重建对齐提升统一多模态模型性能

摘要

统一多模态模型（UMMs）将视觉理解与生成整合于单一架构之中。然而，传统训练依赖于图像-文本对（或序列），其描述通常较为简略，缺乏对视觉细节的精细刻画——即便用数百字描述一幅简单图像时亦是如此。我们提出了重建对齐（RecA），这是一种资源高效的后训练方法，它利用视觉理解编码器的嵌入作为密集的“文本提示”，无需依赖标注即可提供丰富的监督信息。具体而言，RecA让UMM以其自身的视觉理解嵌入为条件，并通过自监督的重建损失优化模型以重构输入图像，从而实现理解与生成的对齐。尽管方法简洁，RecA却具有广泛的适用性：在自回归、掩码自回归及基于扩散的UMMs中，它均能持续提升生成与编辑的保真度。仅需27个GPU小时，采用RecA进行后训练便显著提升了在GenEval（0.73→0.90）和DPGBench（80.93→88.15）上的图像生成性能，同时也在编辑基准测试中取得进步（ImgEdit 3.38→3.75，GEdit 6.94→7.25）。尤为突出的是，RecA超越了众多规模更大的开源模型，并广泛适用于多种UMM架构，确立了其作为UMMs高效通用后训练对齐策略的地位。

English

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

重建对齐提升统一多模态模型性能

Reconstruction Alignment Improves Unified Multimodal Models

摘要

Support