面向統一多模態模型的語義生成式微調
Semantic Generative Tuning for Unified Multimodal Models
May 18, 2026
作者: Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li
cs.AI
摘要
統一多模態模型(UMMs)旨在單一架構中整合視覺理解與視覺生成。然而,現行訓練範式透過稀疏文本信號獨立優化理解任務,並以密集像素目標優化生成任務。這種解耦策略導致表徵空間錯位,使視覺理解與生成相互隔離,進而阻礙彼此增益。本研究首度系統性地探討生成式後訓練,我們將分層視覺任務建構為生成代理,以彌合UMMs中的隔離狀態。實驗結果顯示,高層語義任務——特別是影像分割——為最適代理。不同於關注紋理細節的低層任務(可能誤導模型),分割提供結構化語義,既能顯著強化以視覺為中心的感知能力,亦可提升生成佈局保真度。基於此發現,我們提出語義生成微調(SGT),一種利用分割作為生成代理以對齊並協同多模態能力的新穎範式。機制分析進一步證明,SGT從根本改善特徵線性可分性,並優化視覺-文本注意力分配模式。廣泛評估顯示,SGT在多個主流基準測試中持續提升多模態理解與生成保真度。我們的程式碼已公開於 https://song2yu.github.io/SGT/。
English
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.