面向统一多模态模型的语义生成调优
Semantic Generative Tuning for Unified Multimodal Models
May 18, 2026
作者: Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li
cs.AI
摘要
统一多模态模型(UMMs)致力于在单一架构中整合视觉理解与视觉生成。然而,主流训练范式通过稀疏文本信号独立优化理解能力,并通过密集像素目标独立优化生成能力。这种解耦策略导致表征空间错位,使视觉理解与生成相互割裂,阻碍其相互促进。本文首次系统研究生成式后训练,将层级视觉任务构建为生成代理,以弥合UMMs中的这种割裂。实验发现,高层语义任务(尤其是图像分割)是最优代理。与通过纹理细节分散模型注意力的低层任务不同,分割任务提供的结构语义能显著提升以视觉为中心的感知能力和生成布局保真度。基于此,我们提出语义生成式微调(SGT)这一新范式,利用分割作为生成代理来对齐并协同多模态能力。机制分析进一步表明,SGT从根本上改善了特征线性可分性,并优化了视觉-文本注意力分配模式。大量评估显示,SGT在主流基准测试中持续提升多模态理解与生成保真度。我们的代码已开源至https://song2yu.github.io/SGT/。
English
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.