재구성 정렬이 통합 멀티모달 모델의 성능을 향상시킨다

초록

통합 멀티모달 모델(UMMs)은 시각적 이해와 생성을 단일 아키텍처 내에서 통합합니다. 그러나 기존의 학습은 일반적으로 희소하고 세밀한 시각적 세부 사항을 놓치는 이미지-텍스트 쌍(또는 시퀀스)에 의존합니다. 이는 단순한 이미지를 설명하기 위해 수백 단어를 사용하는 경우에도 마찬가지입니다. 우리는 시각적 이해 인코더 임베딩을 밀집된 "텍스트 프롬프트"로 활용하여 캡션 없이도 풍부한 지도를 제공하는 자원 효율적인 사후 학습 방법인 재구성 정렬(RecA)을 소개합니다. 구체적으로, RecA는 UMM을 자체 시각적 이해 임베딩에 조건화하고 자기 지도 재구성 손실을 통해 입력 이미지를 재구성하도록 최적화함으로써 이해와 생성을 재정렬합니다. RecA는 단순함에도 불구하고 광범위하게 적용 가능합니다: 자기회귀, 마스크된 자기회귀, 그리고 확산 기반 UMM들에 걸쳐 일관되게 생성 및 편집 충실도를 향상시킵니다. 단 27 GPU-시간의 사후 학습으로, RecA는 GenEval(0.73→0.90)과 DPGBench(80.93→88.15)에서 이미지 생성 성능을 크게 개선하며, 편집 벤치마크(ImgEdit 3.38→3.75, GEdit 6.94→7.25)도 향상시킵니다. 특히, RecA는 훨씬 더 큰 오픈소스 모델들을 능가하며 다양한 UMM 아키텍처에 광범위하게 적용 가능하여, UMM을 위한 효율적이고 일반적인 사후 학습 정렬 전략으로 자리매김합니다.

English

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

재구성 정렬이 통합 멀티모달 모델의 성능을 향상시킨다

Reconstruction Alignment Improves Unified Multimodal Models

초록

Support