Fluid: 連続トークンを用いた自己回帰テキストから画像への生成モデルのスケーリング

要旨

ビジョンにおける自己回帰モデルのスケーリングは、大規模言語モデルほど有益ではないことが証明されていません。本研究では、テキストから画像を生成する文脈でこのスケーリング問題を調査し、モデルが離散的または連続的なトークンを使用しているか、およびトークンがランダムまたは固定のラスタ順序で生成されているかに焦点を当てます。BERTやGPTのようなトランスフォーマーアーキテクチャを使用しています。実験結果によると、すべてのモデルは検証損失の観点から効果的にスケーリングされていますが、評価パフォーマンス（FID、GenEvalスコア、および視覚品質によって測定される）は異なる傾向を示しています。連続的なトークンに基づくモデルは、離散的なトークンを使用するモデルよりもはるかに優れた視覚品質を達成しています。さらに、生成順序と注意メカニズムはGenEvalスコアに大きく影響を与えます。ランダム順序モデルは、ラスタ順序モデルと比較して顕著に優れたGenEvalスコアを達成しています。これらの知見に触発され、我々はFluidという連続的なトークン上でランダム順序の自己回帰モデルを訓練しました。Fluid 10.5Bモデルは、MS-COCO 30Kにおける新たなゼロショットFIDの最高記録である6.16と、GenEvalベンチマーク全体スコア0.69を達成しました。我々の知見と結果が、将来の取り組みがビジョンと言語モデルの間のスケーリングのギャップをさらに埋めることを奨励することを願っています。

English

Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

Fluid: 連続トークンを用いた自己回帰テキストから画像への生成モデルのスケーリング

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

要旨

Support