ChatPaper.aiChatPaper

重建對齊提升統一多模態模型效能

Reconstruction Alignment Improves Unified Multimodal Models

September 8, 2025
作者: Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
cs.AI

摘要

統一多模態模型(UMMs)將視覺理解與生成整合於單一架構之中。然而,傳統訓練依賴於圖像-文本對(或序列),其描述通常較為簡略,缺乏細緻的視覺細節——即便使用數百字來描述一張簡單圖片。我們提出了重建對齊(RecA),這是一種資源高效的後訓練方法,它利用視覺理解編碼器的嵌入作為密集的“文本提示”,在無需描述的情況下提供豐富的監督。具體而言,RecA讓UMM基於其自身的視覺理解嵌入進行條件化,並通過自監督的重建損失優化模型以重建輸入圖像,從而實現理解與生成的重新對齊。儘管RecA方法簡潔,但其應用廣泛:無論是自回歸、掩碼自回歸還是基於擴散的UMMs,RecA均能一致地提升生成與編輯的保真度。僅需27個GPU小時,採用RecA進行後訓練即可顯著提升GenEval(0.73→0.90)和DPGBench(80.93→88.15)上的圖像生成性能,同時也提升了編輯基準(ImgEdit 3.38→3.75,GEdit 6.94→7.25)。值得注意的是,RecA超越了許多更大的開源模型,並廣泛適用於多種UMM架構,確立了其作為UMMs高效且通用的後訓練對齊策略的地位。
English
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
PDF382September 10, 2025