自己回帰モデルが拡散モデルを凌駕：スケーラブルな画像生成のためのLlama

要旨

LlamaGenを紹介する。これは、大規模言語モデルのオリジナルな「次トークン予測」パラダイムを視覚生成領域に適用した新しい画像生成モデルファミリーである。これは、視覚信号に対する帰納的バイアスを持たない素朴な自己回帰モデル（例えばLlama）が、適切にスケーリングされた場合に最先端の画像生成性能を達成できるかという問いに対する肯定的な答えである。我々は、画像トークナイザーの設計空間、画像生成モデルのスケーラビリティ特性、およびそれらのトレーニングデータの品質を再検討した。この探求の結果は以下の通りである：（1）ImageNetベンチマークにおいて、ダウンサンプル比率16、再構築品質0.94 rFID、コードブック使用率97%の画像トークナイザー。（2）111Mから3.1Bパラメータまでのクラス条件付き画像生成モデルシリーズで、ImageNet 256x256ベンチマークにおいて2.18 FIDを達成し、LDMやDiTなどの人気のある拡散モデルを上回る。（3）LAION-COCOと高美質画像に対する2段階トレーニングから得られた775Mパラメータのテキスト条件付き画像生成モデルで、視覚品質とテキストアラインメントの競争力のある性能を示す。（4）LLMサービングフレームワークの有効性を検証し、画像生成モデルの推論速度を326% - 414%向上させた。我々は、視覚生成とマルチモーダル基盤モデルのオープンソースコミュニティを促進するために、すべてのモデルとコードを公開する。

English

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

自己回帰モデルが拡散モデルを凌駕：スケーラブルな画像生成のためのLlama

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

要旨

Support