Fast-dLLM v2: 効率的なブロック拡散型LLM

要旨

自己回帰型（AR）大規模言語モデル（LLMs）は、幅広い自然言語タスクにおいて顕著な性能を達成してきたが、その内在的な逐次デコードが推論効率を制限している。本研究では、事前学習済みARモデルを並列テキスト生成のための拡散言語モデル（dLLM）に効率的に適応させるFast-dLLM v2を提案する。これは、Dream（580Bトークン）のような完全注意拡散LLMと比較して、トレーニングデータを500分の1に削減しつつ、元のモデルの性能を維持するものである。我々のアプローチでは、ブロック拡散メカニズムと補完的な注意マスクを組み合わせた新しいトレーニングレシピを導入し、ARトレーニング目標を犠牲にすることなく、ブロック単位の双方向コンテキストモデリングを可能にする。さらに、デコードを加速するために、階層的なキャッシュメカニズムを設計した：ブロックレベルキャッシュはブロック間の履歴コンテキスト表現を保存し、サブブロックキャッシュは部分的にデコードされたブロック内での効率的な並列生成を可能にする。並列デコードパイプラインと組み合わせることで、Fast-dLLM v2は生成品質を損なうことなく、標準的なARデコードに対して最大2.5倍の高速化を実現する。多様なベンチマークでの広範な実験により、Fast-dLLM v2は精度においてARベースラインに匹敵またはそれを上回り、dLLMの中で最先端の効率を提供することが示された。これは、高速かつ正確なLLMの実用的な展開に向けた重要な一歩である。コードとモデルは公開される予定である。

English

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Fast-dLLM v2: 効率的なブロック拡散型LLM

Fast-dLLM v2: Efficient Block-Diffusion LLM

要旨

Support