ReFusion: 並列自己回帰デコーディングを備えた拡散型大規模言語モデル

要旨

自己回帰モデル（ARM）は、逐次的な推論の遅さが課題となっている。マスク拡散モデル（MDM）は並列処理による代替手段を提供するが、重大な欠点を抱えている。すなわち、Key-Value（KV）キャッシュの利用が不可能なことによる高い計算コスト、およびトークン組み合わせの扱いにくい空間における依存関係の学習に起因する非連続的な生成である。これらの制限を解決するため、我々はReFusionを提案する。この新しいマスク拡散モデルは、並列デコードをトークンレベルからより高次のスロットレベル（各スロットは固定長の連続部分列）に昇華させることで、優れた性能と効率を実現する。これは反復的な「計画と埋め込み」デコードプロセスによって達成される。拡散ベースの計画ステップで弱依存なスロット群を特定し、自己回帰的埋め込みステップで選択されたスロットを並列にデコードする。スロットベースの設計は、統一された因果的フレームワークによる完全なKVキャッシュ再利用を可能にすると同時に、学習の複雑さをトークン組み合わせ空間から管理可能なスロット順列空間に削減する。7つの多様なベンチマークによる大規模な実験により、ReFusionが従来のMDMを34%の性能向上と平均18倍超の高速化で圧倒的に凌駕するだけでなく、強力なARMとの性能差を埋めつつ平均2.33倍の高速化を維持することを実証した。

English

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.

ReFusion: 並列自己回帰デコーディングを備えた拡散型大規模言語モデル

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

要旨

Support