자기회귀적 표현 정렬을 통해 텍스트-이미지 생성을 위한 대규모 언어 모델의 잠재력 발휘

초록

우리는 아키텍처 변경 없이도 자동회귀적 LLM에서 전역적으로 일관된 텍스트-이미지 생성을 가능하게 하는 새로운 학습 프레임워크인 Autoregressive Representation Alignment(ARRA)을 제안한다. 복잡한 아키텍처 재설계를 요구하는 기존 연구와 달리, ARRA는 전역 시각 정렬 손실과 하이브리드 토큰 <HYBNEXT>를 통해 외부 시각 기반 모델의 시각적 표현과 LLM의 은닉 상태를 정렬한다. 이 토큰은 로컬 다음 토큰 예측과 전역 의미론적 증류라는 이중 제약을 강제함으로써, LLM이 원래의 자동회귀 패러다임을 유지하면서도 공간적 및 맥락적 일관성을 암묵적으로 학습할 수 있게 한다. 광범위한 실험을 통해 ARRA의 플러그 앤 플레이 유연성이 검증되었다. 텍스트 생성 전용 LLM이나 무작위 초기화로부터 학습할 때, ARRA는 Chameleon 및 LlamaGen과 같은 고급 자동회귀 LLM에서 MIMIC-CXR, DeepEyeNet, ImageNet 데이터셋에 대해 각각 25.5%, 8.8%, 7.5%의 FID(Fréchet Inception Distance) 감소를 달성했으며, 이는 프레임워크 수정 없이 이루어졌다. 도메인 적응의 경우, ARRA는 일반 목적 LLM을 BioMedCLIP와 같은 특화된 모델과 정렬하여, 의료 영상(MIMIC-CXR)에서 직접 미세 조정 대비 18.6%의 FID 감소를 달성했다. ARRA는 아키텍처 혁신뿐만 아니라 학습 목표 재설계가 크로스 모달 전역 일관성 문제를 해결할 수 있음을 보여줌으로써, 자동회귀 모델 발전을 위한 보완적 패러다임을 제시한다. 코드와 모델은 자동회귀적 이미지 생성을 발전시키기 위해 공개될 예정이다.

English

We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, <HYBNEXT>. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.

자기회귀적 표현 정렬을 통해 텍스트-이미지 생성을 위한 대규모 언어 모델의 잠재력 발휘

Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

초록

Support