ChatPaper.aiChatPaper

重建对齐提升统一多模态模型性能

Reconstruction Alignment Improves Unified Multimodal Models

September 8, 2025
作者: Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
cs.AI

摘要

统一多模态模型(UMMs)将视觉理解与生成整合于单一架构之中。然而,传统训练依赖于图像-文本对(或序列),其描述通常较为简略,缺乏对视觉细节的精细刻画——即便用数百字描述一幅简单图像时亦是如此。我们提出了重建对齐(RecA),这是一种资源高效的后训练方法,它利用视觉理解编码器的嵌入作为密集的“文本提示”,无需依赖标注即可提供丰富的监督信息。具体而言,RecA让UMM以其自身的视觉理解嵌入为条件,并通过自监督的重建损失优化模型以重构输入图像,从而实现理解与生成的对齐。尽管方法简洁,RecA却具有广泛的适用性:在自回归、掩码自回归及基于扩散的UMMs中,它均能持续提升生成与编辑的保真度。仅需27个GPU小时,采用RecA进行后训练便显著提升了在GenEval(0.73→0.90)和DPGBench(80.93→88.15)上的图像生成性能,同时也在编辑基准测试中取得进步(ImgEdit 3.38→3.75,GEdit 6.94→7.25)。尤为突出的是,RecA超越了众多规模更大的开源模型,并广泛适用于多种UMM架构,确立了其作为UMMs高效通用后训练对齐策略的地位。
English
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
PDF382September 10, 2025