Fast-dLLM v2：高效區塊擴散式大型語言模型

摘要

自迴歸（AR）大型語言模型（LLMs）在多種自然語言任務中取得了顯著的性能，然而其固有的序列解碼限制了推理效率。在本研究中，我們提出了Fast-dLLM v2，這是一種精心設計的區塊擴散語言模型（dLLM），它有效地將預訓練的AR模型轉化為用於並行文本生成的dLLM，僅需約10億個標記的微調。這與全注意力擴散LLM（如Dream，需5800億個標記）相比，訓練數據量減少了500倍，同時保持了原始模型的性能。我們的方法引入了一種新穎的訓練策略，將區塊擴散機制與互補的注意力掩碼相結合，實現了區塊級雙向上下文建模，而不犧牲AR訓練目標。為了進一步加速解碼，我們設計了一種分層緩存機制：區塊級緩存存儲跨區塊的歷史上下文表示，以及子區塊緩存，使得在部分解碼的區塊內實現高效的並行生成。結合我們的並行解碼管道，Fast-dLLM v2在不影響生成質量的情況下，相比標準AR解碼實現了高達2.5倍的加速。在多樣化的基準測試中進行的廣泛實驗表明，Fast-dLLM v2在準確性上匹配或超越了AR基線，同時在dLLM中提供了最先進的效率——這標誌著快速且準確的LLM實際部署邁出了重要的一步。代碼和模型將公開發布。

English

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Fast-dLLM v2：高效區塊擴散式大型語言模型

Fast-dLLM v2: Efficient Block-Diffusion LLM

摘要

Support