以理解監督引導統一多模態模型中的視覺生成
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
May 7, 2026
作者: Zeyu Liu, Zanlin Ni, Yang Yue, Cheng Da, Huan Yang, Di Zhang, Kun Gai, Gao Huang
cs.AI
摘要
統一多模態模型被視為能彌合理解與生成之間的鴻溝。然而,為了達到競爭性的表現,最先進的模型大多採用高度解耦的理解與生成組件。這種設計雖在個別任務上有效,卻削弱了兩者相互增進所需的連結,使得潛在的協同效應在實證上仍不明確。我們提出明確恢復此協同效應的方法,即引入「理解導向後訓練」(UNO),這是一個輕量級框架,將理解不僅視為一項獨立的任務,更視為一個直接的監督信號,以引導生成表徵。通過納入編碼語義抽象(描述生成)與結構細節(視覺回歸)的目標,我們使有效梯度能夠從理解流向生成。在影像生成與編輯上的廣泛實驗證明,理解能作為生成的有效催化劑。
English
Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.