Uni-MMMU:大规模多学科多模态统一基准
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
October 15, 2025
作者: Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu
cs.AI
摘要
統一多模態模型旨在同時實現視覺理解與生成,然而現有的基準測試鮮少檢驗其真正的整合性。現有的評估要么將這兩種能力孤立對待,要么忽視了那些本質上將它們耦合的任務。為填補這一空白,我們提出了Uni-MMMU,這是一個全面且學科意識的基準測試,系統地展開了生成與理解之間在八個以推理為核心的領域(包括科學、編程、數學和謎題)的雙向協同作用。每項任務均為雙向耦合,要求模型(i)利用概念理解來指導精確的視覺合成,或(ii)將生成作為分析推理的認知支架。Uni-MMMU整合了可驗證的中間推理步驟、獨特的真實值,以及針對文本和視覺輸出的可重複評分協議。通過對最先進的統一模型、僅生成模型和僅理解模型進行廣泛評估,我們揭示了顯著的性能差異和跨模態依賴性,為這些能力何時以及如何相互強化提供了新的見解,並為推進統一模型奠定了可靠的基礎。
English
Unified multimodal models aim to jointly enable visual understanding and
generation, yet current benchmarks rarely examine their true integration.
Existing evaluations either treat the two abilities in isolation or overlook
tasks that inherently couple them. To address this gap, we present Uni-MMMU, a
comprehensive and discipline-aware benchmark that systematically unfolds the
bidirectional synergy between generation and understanding across eight
reasoning-centric domains, including science, coding, mathematics, and puzzles.
Each task is bidirectionally coupled, demanding models to (i) leverage
conceptual understanding to guide precise visual synthesis, or (ii) utilize
generation as a cognitive scaffold for analytical reasoning. Uni-MMMU
incorporates verifiable intermediate reasoning steps, unique ground truths, and
a reproducible scoring protocol for both textual and visual outputs. Through
extensive evaluation of state-of-the-art unified, generation-only, and
understanding-only models, we reveal substantial performance disparities and
cross-modal dependencies, offering new insights into when and how these
abilities reinforce one another, and establishing a reliable foundation for
advancing unified models.