Uni-MMMU：大規模マルチディシプリン・マルチモーダル統合ベンチマーク

要旨

統一マルチモーダルモデルは、視覚的理解と生成を同時に実現することを目指していますが、現在のベンチマークではその真の統合性を十分に検証していません。既存の評価では、これら2つの能力を個別に扱うか、本質的にそれらを結合するタスクを見落としています。このギャップを埋めるため、私たちはUni-MMMUを提案します。これは、科学、コーディング、数学、パズルなど8つの推論中心領域にわたって、生成と理解の双方向の相乗効果を体系的に展開する、包括的かつ分野を意識したベンチマークです。各タスクは双方向に結合されており、モデルに次のことを要求します：(i) 概念的理解を活用して正確な視覚的合成を導くこと、または(ii) 分析的推論のための認知的足場として生成を利用すること。Uni-MMMUは、検証可能な中間推論ステップ、独自のグラウンドトゥルース、およびテキストと視覚的出力の両方に対する再現可能なスコアリングプロトコルを組み込んでいます。最先端の統一モデル、生成専用モデル、理解専用モデルの広範な評価を通じて、大幅な性能差とクロスモーダル依存関係を明らかにし、これらの能力がいつ、どのように互いに強化し合うかについての新たな洞察を提供し、統一モデルの進歩のための信頼できる基盤を確立します。

English

Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.

Uni-MMMU：大規模マルチディシプリン・マルチモーダル統合ベンチマーク

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

要旨

Support