RealUnify: 統合モデルは本当に統合から恩恵を受けるのか？包括的ベンチマーク

要旨

視覚的理解と生成を統合したマルチモーダルモデルは、汎用AIに向けた重要な進展を表しています。しかし、既存のベンチマークでは、このアーキテクチャの統合が実際に構成要素間の相乗的相互作用を可能にするかという根本的な疑問に答えられていません。理解と生成を個別に評価する既存の評価パラダイムでは、統合モデルがその理解力を活用して生成を強化したり、生成的シミュレーションを用いてより深い理解を促進したりできるかを判断するには不十分です。この重要なギャップを埋めるため、我々は双方向の能力相乗性を評価するために特別に設計されたベンチマーク「RealUnify」を導入します。RealUnifyは、10のカテゴリーと32のサブタスクにまたがる1,000の入念に人間が注釈を付けたインスタンスで構成されています。その構造は2つの核心軸を中心に展開されます：1)「理解が生成を強化する」では、常識や論理などの推論を必要とする画像生成が求められ、2)「生成が理解を強化する」では、変換されたり無秩序な視覚入力を再構築する精神的シミュレーションが必要な推論タスクが課されます。重要な貢献は、直接的なエンドツーエンド評価と、タスクを個別の理解と生成の段階に分解する診断的段階的評価を組み合わせた二重評価プロトコルです。このプロトコルにより、パフォーマンスのボトルネックが中核能力の欠如によるものか、それらの統合の失敗によるものかを正確に識別できます。12の主要な統合モデルと6つの専門的ベースラインの大規模評価を通じて、現在の統合モデルは効果的な相乗性を達成するのに依然として苦戦しており、アーキテクチャの統合だけでは不十分であることが明らかになりました。これらの結果は、統合モデリングの可能性を最大限に引き出すためには、新しいトレーニング戦略と帰納的バイアスが必要であることを強調しています。

English

The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

RealUnify: 統合モデルは本当に統合から恩恵を受けるのか？包括的ベンチマーク

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

要旨

Support