大規模言語モデルの可能性を解き放つ：自己回帰的表現アラインメントによるテキストから画像への生成

要旨

我々は、オートレグレッシブLLM（大規模言語モデル）において、アーキテクチャ変更を伴わずにグローバルに一貫したテキストから画像への生成を実現する新しいトレーニングフレームワーク「Autoregressive Representation Alignment (ARRA)」を提案する。従来の複雑なアーキテクチャ再設計を必要とする手法とは異なり、ARRAは外部の視覚基盤モデルからの視覚表現とLLMの隠れ状態を、グローバルな視覚アライメント損失とハイブリッドトークン<HYBNEXT>を用いて整合させる。このトークンは、局所的な次トークン予測とグローバルな意味蒸留という二重の制約を課すことで、LLMが空間的・文脈的一貫性を暗黙的に学習しつつ、元のオートレグレッシブパラダイムを維持することを可能にする。大規模な実験により、ARRAのプラグアンドプレイの汎用性が検証された。テキスト生成専用LLMやランダム初期化からのトレーニングにおいて、ARRAはChameleonやLlamaGenなどの先進的なオートレグレッシブLLMにおいて、フレームワーク変更なしにFIDをMIMIC-CXRで25.5%、DeepEyeNetで8.8%、ImageNetで7.5%削減した。ドメイン適応においては、汎用LLMをBioMedCLIPなどの専門モデルと整合させ、医療画像（MIMIC-CXR）における直接ファインチューニングと比較して18.6%のFID削減を達成した。アーキテクチャ革新だけでなく、トレーニング目的の再設計がクロスモーダルなグローバル一貫性の課題を解決できることを示すことで、ARRAはオートレグレッシブモデルの進化に向けた補完的なパラダイムを提供する。コードとモデルは、オートレグレッシブ画像生成の進展に向けて公開される予定である。

English

We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, <HYBNEXT>. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.

大規模言語モデルの可能性を解き放つ：自己回帰的表現アラインメントによるテキストから画像への生成

Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

要旨

Support