計劃擴散

摘要

大型語言模型推理中的一個核心挑戰在於生成速度與輸出質量之間的權衡。自回歸模型能夠產生高質量的文本，但需要逐個生成詞元。擴散模型則可以並行生成詞元，但通常需要多次迭代才能達到相同的質量。我們提出了計劃擴散法，這是一種結合了兩種範式優勢的混合方法。計劃擴散法分為兩個階段：首先，模型創建一個短的自回歸計劃，將輸出分解為較小且獨立的片段；其次，模型使用擴散法同時生成這些片段。這種方法擴展了速度與質量的帕累托前沿，為實現更快、更高質量的文本生成提供了一條實用路徑。在包含805個指令跟蹤提示的AlpacaEval測試集上，計劃擴散法在質量與延遲之間實現了帕累托最優的權衡，相比自回歸生成，速度提升了1.27倍至1.81倍，而勝率僅分別下降了0.87%至5.4%。我們的敏感性分析表明，計劃擴散法的規劃機制簡潔可靠，且存在簡單的運行時調節選項，可靈活控制質量與延遲的權衡。

English

A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.

Planned Diffusion

摘要

Support