重建对齐提升统一多模态模型性能
Reconstruction Alignment Improves Unified Multimodal Models
September 8, 2025
作者: Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
cs.AI
摘要
统一多模态模型(UMMs)将视觉理解与生成整合于单一架构之中。然而,传统训练依赖于图像-文本对(或序列),其描述通常较为简略,缺乏对视觉细节的精细刻画——即便用数百字描述一幅简单图像时亦是如此。我们提出了重建对齐(RecA),这是一种资源高效的后训练方法,它利用视觉理解编码器的嵌入作为密集的“文本提示”,无需依赖标注即可提供丰富的监督信息。具体而言,RecA让UMM以其自身的视觉理解嵌入为条件,并通过自监督的重建损失优化模型以重构输入图像,从而实现理解与生成的对齐。尽管方法简洁,RecA却具有广泛的适用性:在自回归、掩码自回归及基于扩散的UMMs中,它均能持续提升生成与编辑的保真度。仅需27个GPU小时,采用RecA进行后训练便显著提升了在GenEval(0.73→0.90)和DPGBench(80.93→88.15)上的图像生成性能,同时也在编辑基准测试中取得进步(ImgEdit 3.38→3.75,GEdit 6.94→7.25)。尤为突出的是,RecA超越了众多规模更大的开源模型,并广泛适用于多种UMM架构,确立了其作为UMMs高效通用后训练对齐策略的地位。
English
Unified multimodal models (UMMs) unify visual understanding and generation
within a single architecture. However, conventional training relies on
image-text pairs (or sequences) whose captions are typically sparse and miss
fine-grained visual details--even when they use hundreds of words to describe a
simple image. We introduce Reconstruction Alignment (RecA), a
resource-efficient post-training method that leverages visual understanding
encoder embeddings as dense "text prompts," providing rich supervision without
captions. Concretely, RecA conditions a UMM on its own visual understanding
embeddings and optimizes it to reconstruct the input image with a
self-supervised reconstruction loss, thereby realigning understanding and
generation. Despite its simplicity, RecA is broadly applicable: across
autoregressive, masked-autoregressive, and diffusion-based UMMs, it
consistently improves generation and editing fidelity. With only 27 GPU-hours,
post-training with RecA substantially improves image generation performance on
GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while
also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit
6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models
and applies broadly across diverse UMM architectures, establishing it as an
efficient and general post-training alignment strategy for UMMs