擴散式大型語言模型可通過離散擴散實現超越自回歸的快速推理

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

August 8, 2025

作者: Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng

cs.AI

摘要

扩散大语言模型（dLLMs）作为自回归（AR）LLMs在文本生成领域的一个有前景的替代方案崭露头角，其潜力在于能够单次迭代解码多个令牌。然而，现有的开源dLLMs在推理速度上均未能超越相似规模的自回归LLMs。本文基于一种简单而有效的策略——离散扩散强制（D2F），成功突破了这一障碍。D2F赋予dLLMs两大关键能力：（1）通过块级自回归生成实现KV缓存的有效利用；（2）无需完成前序块即可预测后续令牌，从而实现块间并行解码。由此，原始的dLLMs被改造为一种AR-扩散混合范式，以支持高效推理。D2F可通过基于预训练dLLMs的非对称蒸馏过程实现。我们进一步提出了一种流水线并行解码算法，在效率与效果之间实现了平衡。实证表明，采用D2F的dLLMs在GSM8K数据集上的推理速度较LLaMA3和Qwen2.5提升了超过2.5倍。与LLaDA和Dream等原始dLLMs相比，在保持输出质量相当的同时，加速效果可超过50倍。相关代码已发布于https://github.com/zhijie-group/Discrete-Diffusion-Forcing。

English

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5times inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50times while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.