理解監督を用いた統一マルチモーダルモデルにおける視覚生成の誘導

要旨

統一的なマルチモーダルモデルは、理解と生成の間のギャップを埋めることが期待されている。しかしながら、競争力のある性能を達成するために、最先端のモデルは理解と生成のコンポーネントを大幅に分離した設計を採用している。この設計は個々のタスクには有効である一方、相互強化に必要な接続を弱め、潜在的な相乗効果は経験的に不確かなままである。我々は、理解を別個のタスクとしてだけでなく、生成的表現を導く直接的な監視信号としても扱う軽量フレームワークであるUnderstanding-Oriented Post-Training（UNO）を導入することで、この相乗効果を明示的に回復することを提案する。意味的抽象化（キャプション生成）と構造的詳細（視覚回帰）を符号化する目的関数を取り入れることにより、理解から生成への効果的な勾配の流れを可能にする。画像生成と編集に関する広範な実験により、理解が生成のための効果的な触媒として機能することが実証された。

English

Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.