dParallel:面向大语言模型的可学习并行解码技术
dParallel: Learnable Parallel Decoding for dLLMs
September 30, 2025
作者: Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
cs.AI
摘要
擴散式大型語言模型(dLLMs)近期在研究界引起了廣泛關注,作為自迴歸生成的一種有前景的替代方案,它提供了並行的詞元預測和更低的推理延遲。然而,其並行解碼的潛力仍未被充分探索,現有的開源模型仍需要接近詞元長度的解碼步驟來確保性能。為解決這一問題,我們引入了dParallel,這是一種簡單而有效的方法,能夠釋放dLLMs的固有並行性以實現快速採樣。我們發現,並行解碼的關鍵瓶頸在於掩碼詞元的序列確定性收斂。基於這一洞察,我們提出了方法的核心:確定性強制蒸餾,這是一種新穎的訓練策略,通過蒸餾模型使其遵循原始採樣軌跡,同時強制其在掩碼詞元上更快且並行地達到高確定性。在各種基準測試上的廣泛實驗表明,我們的方法能夠顯著減少解碼步驟的數量,同時保持性能。當應用於LLaDA-8B-Instruct模型時,dParallel在GSM8K上將解碼步驟從256減少到30,實現了8.5倍的加速且無性能下降。在MBPP基準測試中,它將解碼步驟從256減少到24,實現了10.5倍的加速並保持了準確性。我們的代碼可在https://github.com/czg1225/dParallel 獲取。
English
Diffusion large language models (dLLMs) have recently drawn considerable
attention within the research community as a promising alternative to
autoregressive generation, offering parallel token prediction and lower
inference latency. Yet, their parallel decoding potential remains largely
underexplored, as existing open-source models still require nearly token-length
decoding steps to ensure performance. To address this, we introduce dParallel,
a simple and effective method that unlocks the inherent parallelism of dLLMs
for fast sampling. We identify that the key bottleneck to parallel decoding
arises from the sequential certainty convergence for masked tokens. Building on
this insight, we introduce the core of our approach: certainty-forcing
distillation, a novel training strategy that distills the model to follow its
original sampling trajectories while enforcing it to achieve high certainty on
masked tokens more rapidly and in parallel. Extensive experiments across
various benchmarks demonstrate that our method can dramatically reduce the
number of decoding steps while maintaining performance. When applied to the
LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on
GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP
benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup
while maintaining accuracy. Our code is available at
https://github.com/czg1225/dParallel