arcee-ai / Trinity-Large-Thinking-NVFP4

huggingface.co
Total runs: 840
24-hour runs: 0
7-day runs: 7
30-day runs: 631
Model's Last Updated: April 10 2026
text-generation

Introduction of Trinity-Large-Thinking-NVFP4

Model Details of Trinity-Large-Thinking-NVFP4

Arcee Trinity Large Thinking

Trinity-Large-Thinking-NVFP4

Introduction

Trinity-Large-Thinking is a reasoning-optimized variant of Arcee AI's Trinity-Large family — a 398B-parameter sparse Mixture-of-Experts (MoE) model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning and agentic RL.

This repository contains the NVFP4 quantized weights of Trinity-Large-Thinking for deployment on NVIDIA Blackwell GPUs.

For full model details, benchmarks, and usage guidance, see the main Trinity-Large-Thinking model card.

Quantization Details
  • Scheme: NVFP4 ( nvfp4_experts_only — MoE expert weights only, attention and dense layers remain BF16)
  • Tool: NVIDIA ModelOpt
  • Calibration: 2048 samples, seq_length=4096
  • KV cache: Not quantized
Usage
Inference tested on
  • Both Hopper (via Marlin) and Blackwell B300 node
  • vLLM 0.18.0+
vLLM

Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.

Example Blackwell GPUs (B200/B300/GB300) — Docker (recommended)
docker run --runtime nvidia --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.0-cu130 \
  arcee-ai/Trinity-Large-Thinking-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
Hopper GPUs (H100/H200) and others
vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Note (For Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:

export VLLM_NVFP4_GEMM_BACKEND=marlin

vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --moe-backend marlin \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.

Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "arcee-ai/Trinity-Large-Thinking-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Who are you?"}]
input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(input_ids, max_new_tokens=4096, do_sample=True, temperature=0.3, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
API

Works out of the box on OpenRouter as arcee-ai/trinity-large-thinking .

License

Trinity-Large-Thinking-NVFP4 is released under the Apache License, Version 2.0.

Citation

If you use this model, please cite:

@misc{singh2026arceetrinity,
  title        = {Arcee Trinity Large Technical Report},
  author       = {Varun Singh and Lucas Krauss and Sami Jaghouar and Matej Sirovatka and Charles Goddard and Fares Obied and Jack Min Ong and Jannik Straube and Fern and Aria Harley and Conner Stewart and Colin Kealty and Maziyar Panahi and Simon Kirsten and Anushka Deshpande and Anneketh Vij and Arthur Bresnu and Pranav Veldurthi and Raghav Ravishankar and Hardik Bishnoi and DatologyAI Team and Arcee AI Team and Prime Intellect Team and Mark McQuade and Johannes Hagemann and Lucas Atkins},
  year         = {2026},
  eprint       = {2602.17004},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  doi          = {10.48550/arXiv.2602.17004},
  url          = {https://arxiv.org/abs/2602.17004}
}

Runs of arcee-ai Trinity-Large-Thinking-NVFP4 on huggingface.co

840
Total runs
0
24-hour runs
1
3-day runs
7
7-day runs
631
30-day runs

More Information About Trinity-Large-Thinking-NVFP4 huggingface.co Model

More Trinity-Large-Thinking-NVFP4 license Visit here:

https://choosealicense.com/licenses/apache-2.0

Trinity-Large-Thinking-NVFP4 huggingface.co

Trinity-Large-Thinking-NVFP4 huggingface.co is an AI model on huggingface.co that provides Trinity-Large-Thinking-NVFP4's model effect (), which can be used instantly with this arcee-ai Trinity-Large-Thinking-NVFP4 model. huggingface.co supports a free trial of the Trinity-Large-Thinking-NVFP4 model, and also provides paid use of the Trinity-Large-Thinking-NVFP4. Support call Trinity-Large-Thinking-NVFP4 model through api, including Node.js, Python, http.

Trinity-Large-Thinking-NVFP4 huggingface.co Url

https://huggingface.co/arcee-ai/Trinity-Large-Thinking-NVFP4

arcee-ai Trinity-Large-Thinking-NVFP4 online free

Trinity-Large-Thinking-NVFP4 huggingface.co is an online trial and call api platform, which integrates Trinity-Large-Thinking-NVFP4's modeling effects, including api services, and provides a free online trial of Trinity-Large-Thinking-NVFP4, you can try Trinity-Large-Thinking-NVFP4 online for free by clicking the link below.

arcee-ai Trinity-Large-Thinking-NVFP4 online free url in huggingface.co:

https://huggingface.co/arcee-ai/Trinity-Large-Thinking-NVFP4

Trinity-Large-Thinking-NVFP4 install

Trinity-Large-Thinking-NVFP4 is an open source model from GitHub that offers a free installation service, and any user can find Trinity-Large-Thinking-NVFP4 on GitHub to install. At the same time, huggingface.co provides the effect of Trinity-Large-Thinking-NVFP4 install, users can directly use Trinity-Large-Thinking-NVFP4 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

Trinity-Large-Thinking-NVFP4 install url in huggingface.co:

https://huggingface.co/arcee-ai/Trinity-Large-Thinking-NVFP4

Url of Trinity-Large-Thinking-NVFP4

Trinity-Large-Thinking-NVFP4 huggingface.co Url

Provider of Trinity-Large-Thinking-NVFP4 huggingface.co

arcee-ai
ORGANIZATIONS

Other API from arcee-ai

huggingface.co

Total runs: 19.9K
Run Growth: 3.5K
Growth Rate: 17.45%
Updated:September 18 2025
huggingface.co

Total runs: 7.4K
Run Growth: -348
Growth Rate: -4.72%
Updated:December 12 2025
huggingface.co

Total runs: 1.9K
Run Growth: 1.7K
Growth Rate: 88.64%
Updated:October 10 2025
huggingface.co

Total runs: 1.7K
Run Growth: 730
Growth Rate: 41.76%
Updated:September 18 2025
huggingface.co

Total runs: 814
Run Growth: 376
Growth Rate: 48.02%
Updated:July 22 2024
huggingface.co

Total runs: 793
Run Growth: 733
Growth Rate: 93.02%
Updated:January 16 2026
huggingface.co

Total runs: 655
Run Growth: 631
Growth Rate: 96.19%
Updated:September 18 2024
huggingface.co

Total runs: 181
Run Growth: 135
Growth Rate: 74.59%
Updated:June 11 2025
huggingface.co

Total runs: 174
Run Growth: 115
Growth Rate: 66.47%
Updated:September 10 2024
huggingface.co

Total runs: 134
Run Growth: 66
Growth Rate: 80.49%
Updated:August 01 2024
huggingface.co

Total runs: 130
Run Growth: 40
Growth Rate: 30.08%
Updated:July 19 2024
huggingface.co

Total runs: 40
Run Growth: 13
Growth Rate: 32.50%
Updated:February 27 2025
huggingface.co

Total runs: 26
Run Growth: 13
Growth Rate: 50.00%
Updated:June 03 2025
huggingface.co

Total runs: 25
Run Growth: 0
Growth Rate: 0.00%
Updated:June 11 2025