ChatPaper.aiChatPaper

利用理解监督引导统一多模态模型中的视觉生成

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

May 7, 2026
作者: Zeyu Liu, Zanlin Ni, Yang Yue, Cheng Da, Huan Yang, Di Zhang, Kun Gai, Gao Huang
cs.AI

摘要

统一多模态模型被设想为弥合理解与生成之间的鸿沟。然而,为达到具有竞争力的性能,当前最先进的模型大多采用理解与生成高度解耦的设计。这种设计虽有利于个体任务,却削弱了二者相互促进所需的联结,使得潜在协同效应在实证上仍不明确。我们提出通过引入理解导向后训练(UNO)这一轻量级框架来显式恢复这种协同——该框架不仅将理解视为独立任务,更将其作为直接监督信号来引导生成表征。通过纳入编码语义抽象(描述生成)与结构细节(视觉回归)的目标,我们使理解到生成的有效梯度流成为可能。在图像生成与编辑任务上的大量实验表明,理解可有效催化生成过程。
English
Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.
PDF31May 12, 2026