Today, we're announcing
Qwen3-Coder-Next
, an open-weight language model designed specifically for coding agents and local development. It features the following key enhancements:
Super Efficient with Significant Performance
: With only 3B activated parameters (80B total parameters), it achieves performance comparable to models with 10–20x more active parameters, making it highly cost-effective for agent deployment.
Advanced Agentic Capabilities
: Through an elaborate training recipe, it excels at long-horizon reasoning, complex tool usage, and recovery from execution failures, ensuring robust performance in dynamic coding tasks.
Versatile Integration with Real-World IDE
: Its 256k context length, combined with adaptability to various scaffold templates, enables seamless integration with different CLI/IDE platforms (e.g., Claude Code, Qwen Code, Qoder, Kilo, Trae, Cline, etc.), supporting diverse development environments.
Model Overview
Qwen3-Coder-Next
has the following features:
Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 80B in total and 3B activated
Number of Linear Attention Heads: 32 for V and 16 for QK
Head Dimension: 128
Mixture of Experts:
Number of Experts: 512
Number of Activated Experts: 10
Number of Shared Experts: 1
Expert Intermediate Dimension: 512
Context Length: 262,144 natively
NOTE: This model supports only non-thinking mode and does not generate
<think></think>
blocks in its output. Meanwhile, specifying
enable_thinking=False
is no longer required.
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our
blog
,
GitHub
, and
Documentation
.
Quickstart
We advise you to use the latest version of
transformers
.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-Coder-Next"# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Write a quick sort algorithm."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as
32,768
.
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
Deployment
For deployment, you can use the latest
sglang
or
vllm
to create an OpenAI-compatible API endpoint.
SGLang
SGLang
is a fast serving framework for large language models and vision language models.
SGLang could be used to launch a server with OpenAI-compatible API service.
sglang>=v0.5.8
is required for Qwen3-Coder-Next, which can be installed using:
The following command can be used to create an API endpoint at
http://localhost:30000/v1
with maximum context length 256K tokens using tensor parallel on 4 GPUs.
The default context length is 256K. Consider reducing the context length to a smaller value, e.g.,
32768
, if the server fails to start.
vLLM
vLLM
is a high-throughput and memory-efficient inference and serving engine for LLMs.
vLLM could be used to launch a server with OpenAI-compatible API service.
vllm>=0.15.0
is required for Qwen3-Coder-Next, which can be installed using:
The following command can be used to create an API endpoint at
http://localhost:8000/v1
with maximum context length 256K tokens using tensor parallel on 4 GPUs.
Qwen3-Coder-Next huggingface.co is an AI model on huggingface.co that provides Qwen3-Coder-Next's model effect (), which can be used instantly with this unsloth Qwen3-Coder-Next model. huggingface.co supports a free trial of the Qwen3-Coder-Next model, and also provides paid use of the Qwen3-Coder-Next. Support call Qwen3-Coder-Next model through api, including Node.js, Python, http.
Qwen3-Coder-Next huggingface.co is an online trial and call api platform, which integrates Qwen3-Coder-Next's modeling effects, including api services, and provides a free online trial of Qwen3-Coder-Next, you can try Qwen3-Coder-Next online for free by clicking the link below.
unsloth Qwen3-Coder-Next online free url in huggingface.co:
Qwen3-Coder-Next is an open source model from GitHub that offers a free installation service, and any user can find Qwen3-Coder-Next on GitHub to install. At the same time, huggingface.co provides the effect of Qwen3-Coder-Next install, users can directly use Qwen3-Coder-Next installed effect in huggingface.co for debugging and trial. It also supports api for free installation.