1次元セマンティックトークナイザを用いたエンドツーエンド自己回帰画像生成

要旨

自己回帰的画像モデリングは、画像をコンパクトな潜在表現に圧縮するための視覚トークナイザーに依存しています。本研究では、再構成と生成を共同で最適化するエンドツーエンドの学習パイプラインを設計し、生成結果からトークナイザーへの直接的な監督を可能にします。これは、トークナイザーと生成モデルを別々に学習する従来の二段階アプローチとは対照的です。さらに、視覚基盤モデルを活用して自己回帰モデリングのための1次元トークナイザーを改善する方法を検討します。提案する自己回帰生成モデルは、ImageNet 256×256生成においてガイダンスなしで1.48という最先端のFIDスコアを含む、強力な実証結果を達成しました。

English

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

1次元セマンティックトークナイザを用いたエンドツーエンド自己回帰画像生成

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

要旨

Support