ランダム化された自己回帰的ビジュアル生成

要旨

本論文では、視覚生成のためのランダム化オートレグレッシブモデリング（RAR）を提案し、画像生成タスクにおいて新たな最高性能を達成すると同時に、言語モデリングフレームワークと完全に互換性を維持します。提案されたRARはシンプルであり、標準のオートレグレッシブなトレーニングプロセス中に、次のトークン予測目標を持つ入力シーケンス（通常はラスタ形式で順序付けられる）が、確率rで異なる因数分解順序にランダムに置換されます。ここでrは1から始まり、トレーニングの過程で線形に0に減衰します。このアニーリングトレーニング戦略により、モデルは全ての因数分解順序にわたる期待尤度を最大化することを学習し、双方向コンテキストをモデリングする能力を効果的に向上させます。重要なことに、RARはオートレグレッシブモデリングフレームワークの完全性を保持し、言語モデリングと完全に互換性を確保しつつ、画像生成において性能を著しく向上させます。ImageNet-256ベンチマークでは、RARは1.48のFIDスコアを達成し、これは従来の最先端のオートレグレッシブ画像生成器を超えるだけでなく、主要な拡散ベースおよびマスクトランスフォーマーベースの手法も上回ります。コードとモデルはhttps://github.com/bytedance/1d-tokenizer で公開されます。

English

This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer

ランダム化された自己回帰的ビジュアル生成

Randomized Autoregressive Visual Generation

要旨

Support