統一マルチモーダルモデルのための意味的生成チューニング

要旨

統一マルチモーダルモデル（UMM）は、視覚理解と視覚生成を単一のアーキテクチャに統合することを目指している。しかし、現在の訓練パラダイムでは、疎なテキスト信号による理解の最適化と、密なピクセル目標による生成の最適化が独立して行われている。このような分離された戦略は、表現空間のミスアライメントを引き起こし、視覚理解と生成を隔離し、相互強化を妨げている。本研究は、生成的後学習に関する初の体系的な調査を提示し、階層的視覚タスクを生成プロキシとして定式化することで、UMMにおけるこの隔離を橋渡しする。我々の実証的調査により、高次の意味タスク、特に画像セグメンテーションが最適なプロキシであることが明らかになった。低次タスクがテクスチャの詳細でモデルを混乱させるのに対し、セグメンテーションは構造的意味を提供し、視覚中心の知覚と生成レイアウトの忠実度の両方を大幅に向上させる。これらの知見に基づき、我々はセマンティック生成チューニング（SGT）を導入する。これはセグメンテーションを生成プロキシとして活用し、マルチモーダル能力を整列・相乗させる新しいパラダイムである。機構解析により、SGTが特徴の線形分離性を根本的に改善し、視覚・テキスト注意配分パターンを最適化することがさらに示された。広範な評価により、SGTが主流のベンチマークにおいてマルチモーダル理解と生成忠実度の両方を一貫して改善することが実証された。我々のコードはhttps://song2yu.github.io/SGT/で公開されている。

English

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.