dParallel:面向大语言模型的可学习并行解码技术
dParallel: Learnable Parallel Decoding for dLLMs
September 30, 2025
作者: Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
cs.AI
摘要
扩散大语言模型(dLLMs)作为自回归生成的一种有前景的替代方案,近期在研究界引起了广泛关注,其优势在于并行令牌预测和更低的推理延迟。然而,它们的并行解码潜力在很大程度上仍未得到充分探索,因为现有的开源模型仍需接近令牌长度的解码步骤来确保性能。为此,我们提出了dParallel,一种简单而有效的方法,旨在释放dLLMs的固有并行性以实现快速采样。我们发现,并行解码的关键瓶颈在于掩码令牌的序列确定性收敛。基于这一洞察,我们引入了方法的核心:确定性强制蒸馏,这是一种新颖的训练策略,它通过蒸馏模型使其遵循原始采样轨迹,同时强制模型更快且并行地达到对掩码令牌的高确定性。跨多个基准的广泛实验表明,我们的方法能显著减少解码步骤,同时保持性能。将dParallel应用于LLaDA-8B-Instruct模型时,在GSM8K数据集上,解码步骤从256减少到30,实现了8.5倍的加速且无性能损失。在MBPP基准测试中,解码步骤从256降至24,带来了10.5倍的加速,同时保持了准确性。我们的代码可在https://github.com/czg1225/dParallel获取。
English
Diffusion large language models (dLLMs) have recently drawn considerable
attention within the research community as a promising alternative to
autoregressive generation, offering parallel token prediction and lower
inference latency. Yet, their parallel decoding potential remains largely
underexplored, as existing open-source models still require nearly token-length
decoding steps to ensure performance. To address this, we introduce dParallel,
a simple and effective method that unlocks the inherent parallelism of dLLMs
for fast sampling. We identify that the key bottleneck to parallel decoding
arises from the sequential certainty convergence for masked tokens. Building on
this insight, we introduce the core of our approach: certainty-forcing
distillation, a novel training strategy that distills the model to follow its
original sampling trajectories while enforcing it to achieve high certainty on
masked tokens more rapidly and in parallel. Extensive experiments across
various benchmarks demonstrate that our method can dramatically reduce the
number of decoding steps while maintaining performance. When applied to the
LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on
GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP
benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup
while maintaining accuracy. Our code is available at
https://github.com/czg1225/dParallel