Fast-dLLM v2:高效的块扩散大语言模型
Fast-dLLM v2: Efficient Block-Diffusion LLM
September 30, 2025
作者: Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
cs.AI
摘要
自回归(AR)大语言模型(LLMs)在广泛的自然语言任务中取得了显著性能,但其固有的顺序解码限制了推理效率。在本研究中,我们提出了Fast-dLLM v2,一种精心设计的块扩散语言模型(dLLM),它能够高效地将预训练的AR模型适配为并行文本生成的dLLM,仅需约10亿个标记的微调。与全注意力扩散LLM(如Dream,需5800亿标记)相比,这实现了500倍的训练数据缩减,同时保持了原模型的性能。我们的方法引入了一种新颖的训练方案,结合了块扩散机制与互补注意力掩码,使得在不牺牲AR训练目标的前提下,实现块级双向上下文建模。为了进一步加速解码,我们设计了一种分层缓存机制:块级缓存用于跨块存储历史上下文表示,子块缓存则支持在部分解码的块内高效并行生成。结合我们的并行解码管道,Fast-dLLM v2在不影响生成质量的情况下,相比标准AR解码实现了高达2.5倍的加速。跨多个基准的广泛实验表明,Fast-dLLM v2在准确性上匹配或超越了AR基线,同时在dLLM中提供了最先进的效率——标志着向快速准确LLM实际部署迈出了重要一步。代码和模型将公开发布。
English
Autoregressive (AR) large language models (LLMs) have achieved remarkable
performance across a wide range of natural language tasks, yet their inherent
sequential decoding limits inference efficiency. In this work, we propose
Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that
efficiently adapts pretrained AR models into dLLMs for parallel text
generation, requiring only approximately 1B tokens of fine-tuning. This
represents a 500x reduction in training data compared to full-attention
diffusion LLMs such as Dream (580B tokens), while preserving the original
model's performance. Our approach introduces a novel training recipe that
combines a block diffusion mechanism with a complementary attention mask,
enabling blockwise bidirectional context modeling without sacrificing AR
training objectives. To further accelerate decoding, we design a hierarchical
caching mechanism: a block-level cache that stores historical context
representations across blocks, and a sub-block cache that enables efficient
parallel generation within partially decoded blocks. Coupled with our parallel
decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR
decoding without compromising generation quality. Extensive experiments across
diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR
baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs
- marking a significant step toward the practical deployment of fast and
accurate LLMs. Code and model will be publicly released.