ChatPaper.aiChatPaper

生成增强理解:多表征生成在统一多模态模型中的作用

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

January 29, 2026
作者: Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu
cs.AI

摘要

统一多模态模型(UMMs)将视觉理解与生成功能整合于单一框架内,其终极目标是构建理解与生成相互促进的闭环机制。尽管近期的后训练方法已成功利用理解能力提升生成质量,但如何通过生成技术反哺理解能力仍属探索不足的领域。本文提出UniMRG(统一多表征生成),一种简单高效且与架构无关的后训练方法。该方法通过引入辅助生成任务增强UMMs的理解能力:在标准视觉理解目标基础上,训练模型同步生成输入图像的多类内在表征——包括像素级(重建)、深度(几何)及分割(结构)信息。通过融合这些互补性表征,UMMs能够更全面地捕捉外观特征、空间关系和结构布局,从而实现对视觉输入的深层认知。跨多种UMM架构的大规模实验表明,本方法显著提升了细粒度感知能力,减少幻觉现象,增强空间理解效能,同时同步优化了生成性能。
English
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
PDF34January 31, 2026