計画拡散

要旨

大規模言語モデルの推論における中心的な課題は、生成速度と出力品質のトレードオフである。自己回帰モデルは高品質なテキストを生成するが、トークンを逐次的に生成する。拡散モデルはトークンを並列に生成できるが、同じ品質を達成するために多くの反復を必要とすることが多い。我々は、両パラダイムの強みを組み合わせたハイブリッド手法である計画拡散（planned diffusion）を提案する。計画拡散は2段階で動作する：まず、モデルは出力をより小さな独立したスパンに分割する短い自己回帰計画を作成する。次に、モデルは拡散を用いてこれらのスパンを同時に生成する。このアプローチは、速度と品質のパレートフロンティアを拡大し、高速で高品質なテキスト生成への実用的な道筋を提供する。805の指示追従プロンプトからなるAlpacaEvalにおいて、計画拡散は品質とレイテンシの間でパレート最適なトレードオフを達成し、自己回帰生成に対して1.27倍から1.81倍の高速化を実現し、勝率の低下はそれぞれ0.87％から5.4％に留まった。我々の感度分析は、計画拡散の計画メカニズムが最小限で信頼性が高く、品質とレイテンシのトレードオフを柔軟に制御するためのシンプルなランタイム調整が存在することを示している。

English

A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.

Planned Diffusion

要旨

Support