面向統一多模態模型的語義生成式微調

摘要

統一多模態模型（UMMs）旨在單一架構中整合視覺理解與視覺生成。然而，現行訓練範式透過稀疏文本信號獨立優化理解任務，並以密集像素目標優化生成任務。這種解耦策略導致表徵空間錯位，使視覺理解與生成相互隔離，進而阻礙彼此增益。本研究首度系統性地探討生成式後訓練，我們將分層視覺任務建構為生成代理，以彌合UMMs中的隔離狀態。實驗結果顯示，高層語義任務——特別是影像分割——為最適代理。不同於關注紋理細節的低層任務（可能誤導模型），分割提供結構化語義，既能顯著強化以視覺為中心的感知能力，亦可提升生成佈局保真度。基於此發現，我們提出語義生成微調（SGT），一種利用分割作為生成代理以對齊並協同多模態能力的新穎範式。機制分析進一步證明，SGT從根本改善特徵線性可分性，並優化視覺-文本注意力分配模式。廣泛評估顯示，SGT在多個主流基準測試中持續提升多模態理解與生成保真度。我們的程式碼已公開於 https://song2yu.github.io/SGT/。

English

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.