拡散言語モデルはデコード前に答えを知っている

要旨

拡散言語モデル（DLMs）は最近、自己回帰的アプローチの代替として登場し、並列シーケンス生成と柔軟なトークン順序を提供している。しかし、その推論速度は依然として自己回帰モデルよりも遅く、主に双方向注意のコストと高品質な出力を得るために必要な多数の精緻化ステップが原因である。本研究では、DLMsの早期回答収束という見過ごされがちな特性に注目し、それを活用する。具体的には、多くの場合、半自己回帰的およびランダムリマスキングスケジュールの下で、最終デコードステップの前に半数のステップで正しい回答を内部で識別できることを示す。例えば、GSM8KとMMLUでは、それぞれ最大97％と99％のインスタンスが、精緻化ステップの半数だけで正しくデコードできる。この観察に基づき、我々はProphetを導入する。これは、トレーニング不要の高速デコードパラダイムであり、早期コミットデコードを可能にする。具体的には、Prophetは、トップ2の予測候補間の信頼度ギャップを基準として、精緻化を続行するか「オールイン」（つまり、残りのトークンを1ステップでデコードする）かを動的に決定する。既存のDLM実装にシームレスに統合され、無視できるオーバーヘッドしか発生せず、追加のトレーニングも不要である。LLaDA-8BとDream-7Bを用いた複数タスクでの実証評価では、Prophetがデコードステップ数を最大3.4倍削減しつつ、高い生成品質を維持することが示された。これらの結果は、DLMデコードをサンプリングをいつ停止するかという問題として再定義し、早期デコード収束が既存の高速化技術を補完するシンプルかつ強力なメカニズムを提供することを示している。我々のコードはhttps://github.com/pixeli99/Prophetで公開されている。

English

Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

拡散言語モデルはデコード前に答えを知っている

Diffusion Language Models Know the Answer Before Decoding

要旨

Support