dParallel：面向大语言模型的可学习并行解码技术

摘要

扩散大语言模型（dLLMs）作为自回归生成的一种有前景的替代方案，近期在研究界引起了广泛关注，其优势在于并行令牌预测和更低的推理延迟。然而，它们的并行解码潜力在很大程度上仍未得到充分探索，因为现有的开源模型仍需接近令牌长度的解码步骤来确保性能。为此，我们提出了dParallel，一种简单而有效的方法，旨在释放dLLMs的固有并行性以实现快速采样。我们发现，并行解码的关键瓶颈在于掩码令牌的序列确定性收敛。基于这一洞察，我们引入了方法的核心：确定性强制蒸馏，这是一种新颖的训练策略，它通过蒸馏模型使其遵循原始采样轨迹，同时强制模型更快且并行地达到对掩码令牌的高确定性。跨多个基准的广泛实验表明，我们的方法能显著减少解码步骤，同时保持性能。将dParallel应用于LLaDA-8B-Instruct模型时，在GSM8K数据集上，解码步骤从256减少到30，实现了8.5倍的加速且无性能损失。在MBPP基准测试中，解码步骤从256降至24，带来了10.5倍的加速，同时保持了准确性。我们的代码可在https://github.com/czg1225/dParallel获取。

English

Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at https://github.com/czg1225/dParallel

dParallel：面向大语言模型的可学习并行解码技术

dParallel: Learnable Parallel Decoding for dLLMs

摘要

Support