ChatPaper.aiChatPaper

生成增强理解:统一多模态模型中的多表征生成机制

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

January 29, 2026
作者: Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu
cs.AI

摘要

统一多模态模型(UMMs)将视觉理解与生成功能整合于单一框架内,其终极目标是构建理解与生成相互强化的循环体系。尽管近期后训练方法已成功利用理解能力提升生成质量,但利用生成能力增强理解的反向路径仍鲜有探索。本文提出UniMRG(统一多表征生成),一种简洁且与架构无关的后训练方法。该方法通过引入辅助生成任务来增强UMMs的理解能力:在标准视觉理解目标基础上,训练模型同步生成输入图像的多类内在表征——包括像素级(重建)、深度(几何)及分割(结构)信息。通过综合这些多样化表征,UMMs能够捕捉关于外观特征、空间关系和结构布局的互补信息,从而形成对视觉输入更深入全面的理解。跨多种UMM架构的大规模实验表明,本方法显著提升了细粒度感知能力,减少幻觉现象,增强空间理解,同时同步强化了生成性能。
English
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
PDF34January 31, 2026