Token-Shuffle: 自己回帰モデルによる高解像度画像生成に向けて

要旨

オートリグレッシブ（AR）モデルは、長らく言語生成の分野で支配的であったが、最近では画像合成にも応用されるようになってきている。しかし、ARモデルは拡散モデル（Diffusion-based models）に比べて競争力が低いとされることが多い。その主な制約は、ARモデルが大量の画像トークンを必要とすることであり、これが学習と推論の効率、および画像解像度を制限している。この問題を解決するため、我々はTransformerにおける画像トークンの数を削減する新しいシンプルな手法、Token-Shuffleを提案する。我々の重要な洞察は、マルチモーダル大規模言語モデル（MLLMs）における視覚語彙の次元冗長性であり、視覚エンコーダからの低次元の視覚コードが高次元の言語語彙に直接マッピングされる点である。これを活用し、我々は2つの主要な操作を考案した：トークンシャッフル（token-shuffle）は、空間的に近接したトークンをチャネル次元に沿って統合し、入力トークン数を減少させる。トークンアンシャッフル（token-unshuffle）は、Transformerブロック後の推論されたトークンを解きほぐし、出力のための空間配置を復元する。テキストプロンプトと共に共同学習を行うことで、我々の戦略は追加の事前学習済みテキストエンコーダを必要とせず、MLLMsが効率的な学習と推論を維持しながら、統一された次トークン予測方式で極めて高解像度の画像合成をサポートすることを可能にする。我々は初めて、ARテキスト画像生成の限界を2048x2048の解像度に押し上げ、満足のいく生成性能を達成した。GenAIベンチマークにおいて、我々の2.7Bモデルはハードプロンプトで0.77の総合スコアを達成し、ARモデルのLlamaGenを0.18、拡散モデルのLDMを0.15上回った。大規模な人間評価も、テキスト整合性、視覚的欠陥、視覚的外観の点で我々の優れた画像生成能力を示している。我々は、Token-ShuffleがMLLMs内での効率的な高解像度画像生成の基礎設計として役立つことを期待している。

English

Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

Token-Shuffle: 自己回帰モデルによる高解像度画像生成に向けて

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

要旨

Support