Fast-dLLM v2:高效區塊擴散式大型語言模型
Fast-dLLM v2: Efficient Block-Diffusion LLM
September 30, 2025
作者: Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
cs.AI
摘要
自迴歸(AR)大型語言模型(LLMs)在多種自然語言任務中取得了顯著的性能,然而其固有的序列解碼限制了推理效率。在本研究中,我們提出了Fast-dLLM v2,這是一種精心設計的區塊擴散語言模型(dLLM),它有效地將預訓練的AR模型轉化為用於並行文本生成的dLLM,僅需約10億個標記的微調。這與全注意力擴散LLM(如Dream,需5800億個標記)相比,訓練數據量減少了500倍,同時保持了原始模型的性能。我們的方法引入了一種新穎的訓練策略,將區塊擴散機制與互補的注意力掩碼相結合,實現了區塊級雙向上下文建模,而不犧牲AR訓練目標。為了進一步加速解碼,我們設計了一種分層緩存機制:區塊級緩存存儲跨區塊的歷史上下文表示,以及子區塊緩存,使得在部分解碼的區塊內實現高效的並行生成。結合我們的並行解碼管道,Fast-dLLM v2在不影響生成質量的情況下,相比標準AR解碼實現了高達2.5倍的加速。在多樣化的基準測試中進行的廣泛實驗表明,Fast-dLLM v2在準確性上匹配或超越了AR基線,同時在dLLM中提供了最先進的效率——這標誌著快速且準確的LLM實際部署邁出了重要的一步。代碼和模型將公開發布。
English
Autoregressive (AR) large language models (LLMs) have achieved remarkable
performance across a wide range of natural language tasks, yet their inherent
sequential decoding limits inference efficiency. In this work, we propose
Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that
efficiently adapts pretrained AR models into dLLMs for parallel text
generation, requiring only approximately 1B tokens of fine-tuning. This
represents a 500x reduction in training data compared to full-attention
diffusion LLMs such as Dream (580B tokens), while preserving the original
model's performance. Our approach introduces a novel training recipe that
combines a block diffusion mechanism with a complementary attention mask,
enabling blockwise bidirectional context modeling without sacrificing AR
training objectives. To further accelerate decoding, we design a hierarchical
caching mechanism: a block-level cache that stores historical context
representations across blocks, and a sub-block cache that enables efficient
parallel generation within partially decoded blocks. Coupled with our parallel
decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR
decoding without compromising generation quality. Extensive experiments across
diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR
baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs
- marking a significant step toward the practical deployment of fast and
accurate LLMs. Code and model will be publicly released.