UniG2U-Bench：統一モデルはマルチモーダル理解を進展させるか？

要旨

統合マルチモーダルモデルは近年強力な生成能力を示しているが、生成が理解を促進するか否か、またその条件については未解明である。既存のベンチマークは、生成が理解を促進する具体的なタスクを体系的に検証していない。この課題に対し、我々はUniG2U-Benchを提案する。これは生成から理解（G2U）の評価を7つの領域と30のサブタスクに分類し、暗黙的・明示的な様々な視覚的変換を要求する包括的ベンチマークである。30以上のモデルを用いた大規模評価により、3つの核心的知見が得られた：1）統合モデルは一般に基盤となる視覚言語モデル（VLM）を下回り、Generate-then-Answer（GtA）推論は直接推論よりも性能を劣化させる傾向がある。2）空間知能・錯視・マルチラウンド推論のサブタスクでは一貫した改善が見られ、強化された空間・形状認識や多段階の中間画像状態が有効である。3）類似の推論構造を持つタスクや共通アーキテクチャのモデルは相関した挙動を示し、生成と理解の結合がタスク・事前学習データ・モデルアーキテクチャに跨るクラス一貫的な帰納バイアスを誘発することを示唆する。これらの発見は、統合マルチモーダルモデリングの可能性を最大限引き出すため、より多様な訓練データと新たなパラダイムの必要性を浮き彫りにする。

English

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

UniG2U-Bench：統一モデルはマルチモーダル理解を進展させるか？

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

要旨

Support