重建對齊提升統一多模態模型效能

摘要

統一多模態模型（UMMs）將視覺理解與生成整合於單一架構之中。然而，傳統訓練依賴於圖像-文本對（或序列），其描述通常較為簡略，缺乏細緻的視覺細節——即便使用數百字來描述一張簡單圖片。我們提出了重建對齊（RecA），這是一種資源高效的後訓練方法，它利用視覺理解編碼器的嵌入作為密集的“文本提示”，在無需描述的情況下提供豐富的監督。具體而言，RecA讓UMM基於其自身的視覺理解嵌入進行條件化，並通過自監督的重建損失優化模型以重建輸入圖像，從而實現理解與生成的重新對齊。儘管RecA方法簡潔，但其應用廣泛：無論是自回歸、掩碼自回歸還是基於擴散的UMMs，RecA均能一致地提升生成與編輯的保真度。僅需27個GPU小時，採用RecA進行後訓練即可顯著提升GenEval（0.73→0.90）和DPGBench（80.93→88.15）上的圖像生成性能，同時也提升了編輯基準（ImgEdit 3.38→3.75，GEdit 6.94→7.25）。值得注意的是，RecA超越了許多更大的開源模型，並廣泛適用於多種UMM架構，確立了其作為UMMs高效且通用的後訓練對齊策略的地位。

English

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

重建對齊提升統一多模態模型效能

Reconstruction Alignment Improves Unified Multimodal Models

摘要

Support