蒸留デコーディング2：条件付きスコア蒸留による画像自己回帰モデルのワンステップサンプリング

要旨

画像自己回帰（AR）モデルは、視覚的生成モデルの有力なパラダイムとして登場しました。その有望な性能にもかかわらず、多数のサンプリングステップを必要とするため、生成速度が遅いという課題があります。最近、画像ARモデル向けに少数ステップサンプリングを可能とするDistilled Decoding 1（DD1）が提案されましたが、1ステップ設定では依然として性能劣化が大きく、事前定義されたマッピングに依存するため柔軟性に制限があります。本研究では、画像ARモデルにおける1ステップサンプリングの実現性をさらに推進する新手法、Distilled Decoding 2（DD2）を提案します。DD1とは異なり、DD2は事前定義されたマッピングに依存しません。元のARモデルを教師モデルと見なし、各トークン位置における潜在埋め込み空間の真の条件付きスコアを提供すると考えます。これに基づき、1ステップ生成器を訓練するための新しい条件付きスコア蒸留損失を提案します。具体的には、生成分布の条件付きスコアを予測する別ネットワークを訓練し、過去のトークンを条件とする各トークン位置でスコア蒸留を適用します。実験結果から、DD2は画像ARモデルに対し、ImageNet-256におけるFIDが3.40から5.43へと最小限の増加で1ステップサンプリングを可能にすることが示されました。最も強力なベースラインであるDD1と比較して、DD2は1ステップサンプリングと元のARモデルとの性能差を67%削減し、同時に最大12.3倍の訓練速度向上を達成しました。DD2は、1ステップAR生成という目標に向けた重要な一歩であり、高速かつ高品質なARモデリングへの新たな可能性を開くものです。コードはhttps://github.com/imagination-research/Distilled-Decoding-2 で公開されています。

English

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel conditional score distillation loss to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3times training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https://github.com/imagination-research/Distilled-Decoding-2.

蒸留デコーディング2：条件付きスコア蒸留による画像自己回帰モデルのワンステップサンプリング

Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

要旨

Support