離散拡散による高速推論：Diffusion LLMsが自己回帰を超える可能性 Forcing

要旨

Diffusion Large Language Models (dLLMs) は、テキスト生成において自己回帰型 (AR) LLMs の有望な代替として登場し、単一の反復で複数のトークンをデコードする可能性を秘めています。しかし、既存のオープンソース dLLMs のいずれも、同規模の AR LLMs を上回る推論速度を達成していません。本論文は、discrete diffusion forcing (D2F) というシンプルで効果的な戦略に基づいてこの障壁を打破します。D2F は dLLMs に2つの重要な能力を付与します：(1) KVキャッシュの活用を可能にするブロック単位の自己回帰生成、(2) 前のブロックの完了を必要とせずに次のトークンを予測するブロック間並列デコード。これにより、従来の dLLMs は効率的な推論のための AR-diffusion ハイブリッドパラダイムに改造されます。D2F は、事前学習済み dLLMs に基づく非対称蒸留プロセスで実装可能です。さらに、効率と効果のトレードオフを可能にするパイプライン並列デコードアルゴリズムを提案します。実験的には、D2F dLLMs は GSM8K において LLaMA3 や Qwen2.5 よりも2.5倍以上の推論速度を達成します。LLaDA や Dream のような従来の dLLMs と比較すると、出力品質を維持しながら50倍以上の高速化が可能です。コードは https://github.com/zhijie-group/Discrete-Diffusion-Forcing で公開されています。

English

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5times inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50times while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

離散拡散による高速推論：Diffusion LLMsが自己回帰を超える可能性 Forcing

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

要旨

Support