We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy.
Overview of proposed certainty-forcing distillation.
import torch
from transformers import AutoModel, AutoTokenizer
import types
model_path = "Zigeng/dParallel_Dream_7B_Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.to("cuda").eval()
from model.generation_utils_semiar import DreamGenerationMixin
model.diffusion_generate = types.MethodType(DreamGenerationMixin.diffusion_generate, model)
model._sample = types.MethodType(DreamGenerationMixin._sample, model)
messages = [
{"role": "user", "content": "Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. How many sheep do Toulouse, Charleston, and Seattle have together if Seattle has 20 sheep? Let's think step by step."}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True, add_generation_prompt=True
)
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")
output, nfe = model.diffusion_generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=256,
output_history=False,
return_dict_in_generate=True,
steps=256,
temperature=0.,
top_p=None,
alg="entropy_threshold",
alg_temp=0.1,
top_k=None,
block_length=32,
threshold=0.5,
)
generations = [
tokenizer.decode(g[0:].tolist())
for p, g inzip(input_ids, output.sequences)
]
print(generations[0].split(tokenizer.eos_token)[0])
print("NFE:", nfe)
📖 Experimental Results
Results on LLaDA-8B-Instruct:
Results on Dream-7B-Instruct:
Better Speed-Accuracy Trade-off:
☀️ Acknowledgement
Our code builds on
LLaDA
,
Dream
,
Fast-dLLM
, and
dKV-Cache
, and we acknowledge these great works for laying the groundwork that made our approach possible.
Citation
If our research assists your work, please give us a star ⭐ or cite us using:
@article{chen2025dparallel,
title={dParallel: Learnable Parallel Decoding for dLLMs},
author={Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Yu, Ruonan and Wang, Xinchao},
journal={arXiv preprint arXiv:2509.26488},
year={2025}
}
Runs of Zigeng dParallel_Dream_7B_Instruct on huggingface.co
1.9K
Total runs
0
24-hour runs
0
3-day runs
-31
7-day runs
1.8K
30-day runs
More Information About dParallel_Dream_7B_Instruct huggingface.co Model
More dParallel_Dream_7B_Instruct license Visit here:
dParallel_Dream_7B_Instruct huggingface.co is an AI model on huggingface.co that provides dParallel_Dream_7B_Instruct's model effect (), which can be used instantly with this Zigeng dParallel_Dream_7B_Instruct model. huggingface.co supports a free trial of the dParallel_Dream_7B_Instruct model, and also provides paid use of the dParallel_Dream_7B_Instruct. Support call dParallel_Dream_7B_Instruct model through api, including Node.js, Python, http.
dParallel_Dream_7B_Instruct huggingface.co is an online trial and call api platform, which integrates dParallel_Dream_7B_Instruct's modeling effects, including api services, and provides a free online trial of dParallel_Dream_7B_Instruct, you can try dParallel_Dream_7B_Instruct online for free by clicking the link below.
Zigeng dParallel_Dream_7B_Instruct online free url in huggingface.co:
dParallel_Dream_7B_Instruct is an open source model from GitHub that offers a free installation service, and any user can find dParallel_Dream_7B_Instruct on GitHub to install. At the same time, huggingface.co provides the effect of dParallel_Dream_7B_Instruct install, users can directly use dParallel_Dream_7B_Instruct installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
dParallel_Dream_7B_Instruct install url in huggingface.co: