Fast-dLLM v2：高效的块扩散大语言模型

摘要

自回归（AR）大语言模型（LLMs）在广泛的自然语言任务中取得了显著性能，但其固有的顺序解码限制了推理效率。在本研究中，我们提出了Fast-dLLM v2，一种精心设计的块扩散语言模型（dLLM），它能够高效地将预训练的AR模型适配为并行文本生成的dLLM，仅需约10亿个标记的微调。与全注意力扩散LLM（如Dream，需5800亿标记）相比，这实现了500倍的训练数据缩减，同时保持了原模型的性能。我们的方法引入了一种新颖的训练方案，结合了块扩散机制与互补注意力掩码，使得在不牺牲AR训练目标的前提下，实现块级双向上下文建模。为了进一步加速解码，我们设计了一种分层缓存机制：块级缓存用于跨块存储历史上下文表示，子块缓存则支持在部分解码的块内高效并行生成。结合我们的并行解码管道，Fast-dLLM v2在不影响生成质量的情况下，相比标准AR解码实现了高达2.5倍的加速。跨多个基准的广泛实验表明，Fast-dLLM v2在准确性上匹配或超越了AR基线，同时在dLLM中提供了最先进的效率——标志着向快速准确LLM实际部署迈出了重要一步。代码和模型将公开发布。

English

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Fast-dLLM v2：高效的块扩散大语言模型

Fast-dLLM v2: Efficient Block-Diffusion LLM

摘要

Support