扩散式大语言模型可通过离散扩散实现超越自回归推理的速度强制机制

摘要

扩散大语言模型（dLLMs）作为自回归（AR）LLMs在文本生成领域的一种有前景的替代方案崭露头角，其潜力在于单次迭代中可解码多个令牌。然而，现有开源dLLMs在推理速度上均未能超越同等规模的自回归LLMs。本文基于一种名为离散扩散强制（D2F）的简单有效策略，成功突破了这一瓶颈。D2F赋予dLLMs两大关键能力：（1）块级自回归生成，以利用KV缓存；（2）无需完成前序块即可预测后续令牌，实现块间并行解码。借此，传统dLLMs被改造为AR-扩散混合范式，以高效推理。D2F可通过基于预训练dLLMs的非对称蒸馏过程实现。我们进一步提出了一种流水线并行解码算法，在效率与效果之间达成平衡。实验表明，D2F dLLMs在GSM8K上的推理速度比LLaMA3和Qwen2.5快2.5倍以上。与LLaDA和Dream等传统dLLMs相比，在保持输出质量相当的同时，加速效果可超过50倍。代码已发布于https://github.com/zhijie-group/Discrete-Diffusion-Forcing。

English

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5times inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50times while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

扩散式大语言模型可通过离散扩散实现超越自回归推理的速度强制机制

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

摘要

Support