DiSA: 自己回帰的画像生成における拡散ステップアニーリング

要旨

MAR、FlowAR、xAR、Harmonなど、自己回帰モデルの数が増えるにつれ、画像生成の品質を向上させるために拡散サンプリングを採用するケースが増えています。しかし、この戦略は推論効率の低下を招きます。なぜなら、拡散プロセスで1つのトークンをサンプリングするのに通常50から100ステップを要するからです。本論文では、この問題を効果的に解決する方法を探ります。私たちの主要な動機は、自己回帰プロセス中により多くのトークンが生成されるにつれ、後続のトークンはより制約された分布に従い、サンプリングが容易になるという点です。直感的に説明すると、モデルが犬の一部を生成した場合、残りのトークンは犬を完成させる必要があり、それゆえに制約が強くなります。経験的証拠は私たちの動機を支持しています：生成の後期段階では、次のトークンは多層パーセプトロンによってよく予測でき、分散が低く、ノイズからトークンへのノイズ除去パスが直線に近くなります。この発見に基づいて、拡散ステップアニーリング（DiSA）を導入します。これは、より多くのトークンが生成されるにつれて徐々に拡散ステップ数を減らすトレーニング不要の方法です。例えば、最初は50ステップを使用し、後期段階では徐々に5ステップに減少させます。DiSAは、自己回帰モデルにおける拡散に特化した私たちの発見から導き出されたため、拡散単体向けに設計された既存の高速化手法と補完的です。DiSAは既存のモデルに数行のコードで実装でき、シンプルながらも、MARとHarmonでは5～10倍、FlowARとxARでは1.4～2.5倍の推論速度向上を達成し、生成品質を維持します。

English

An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves 5-10times faster inference for MAR and Harmon and 1.4-2.5times for FlowAR and xAR, while maintaining the generation quality.

DiSA: 自己回帰的画像生成における拡散ステップアニーリング

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

要旨

Support