透過自回歸表示對齊釋放大規模語言模型在文本到圖像生成中的潛力
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
March 10, 2025
作者: Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu
cs.AI
摘要
我們提出了自迴歸表示對齊(ARRA),這是一種新的訓練框架,能夠在不改變架構的情況下,實現自迴歸大型語言模型(LLMs)的全局一致性文本到圖像生成。與之前需要複雜架構重新設計的工作不同,ARRA通過全局視覺對齊損失和混合標記<HYBNEXT>,將LLM的隱藏狀態與外部視覺基礎模型的視覺表示對齊。該標記施加了雙重約束:局部下一個標記預測和全局語義蒸餾,使LLM能夠在保持其原始自迴歸範式的同時,隱式學習空間和上下文一致性。大量實驗驗證了ARRA的即插即用多功能性。當從僅用於文本生成的LLM或隨機初始化進行訓練時,ARRA在無需修改框架的情況下,為Chameleon和LlamaGen等高級自迴歸LLM分別降低了25.5%(MIMIC-CXR)、8.8%(DeepEyeNet)和7.5%(ImageNet)的FID。在領域適應方面,ARRA將通用LLM與專業模型(如BioMedCLIP)對齊,在醫學影像(MIMIC-CXR)上比直接微調實現了18.6%的FID降低。通過展示訓練目標的重新設計——而不僅僅是架構創新——可以解決跨模態全局一致性挑戰,ARRA為推進自迴歸模型提供了一種互補的範式。代碼和模型將被公開,以推動自迴歸圖像生成的發展。
English
We present Autoregressive Representation Alignment (ARRA), a new training
framework that unlocks global-coherent text-to-image generation in
autoregressive LLMs without architectural changes. Unlike prior work that
requires complex architectural redesigns, ARRA aligns LLM hidden states with
visual representations from external visual foundational models via a global
visual alignment loss and a hybrid token, <HYBNEXT>. This token enforces dual
constraints: local next-token prediction and global semantic distillation,
enabling LLMs to implicitly learn spatial and contextual coherence while
retaining their original autoregressive paradigm. Extensive experiments
validate ARRA's plug-and-play versatility. When training from
text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5%
(MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive
LLMs like Chameleon and LlamaGen, all without framework modifications. For
domain adaption, ARRA aligns general-purpose LLMs with specialized models
(e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on
medical imaging (MIMIC-CXR). By demonstrating that training objective redesign
-- not just architectural innovation -- can resolve cross-modal global
coherence challenges, ARRA offers a complementary paradigm for advancing
autoregressive models. Code and models will be released to advance
autoregressive image generation.Summary
AI-Generated Summary