Uni-MMMU:大规模多学科多模态统一基准
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
October 15, 2025
作者: Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu
cs.AI
摘要
统一多模态模型旨在同时实现视觉理解与生成,然而当前基准测试鲜少考察这两者的真正融合。现有评估要么将这两种能力孤立对待,要么忽视了那些本质上将它们耦合的任务。为填补这一空白,我们提出了Uni-MMMU,一个全面且学科意识强的基准测试,它系统性地展现了生成与理解在八个以推理为核心的领域(包括科学、编程、数学和谜题)之间的双向协同作用。每项任务均双向耦合,要求模型:(i)利用概念理解指导精确的视觉合成,或(ii)将生成作为分析推理的认知支架。Uni-MMMU整合了可验证的中间推理步骤、独特的真实答案,以及针对文本和视觉输出的可复现评分协议。通过对当前最先进的统一模型、仅生成模型和仅理解模型进行广泛评估,我们揭示了显著的性能差异和跨模态依赖性,为这些能力何时及如何相互强化提供了新见解,并为推进统一模型的发展奠定了可靠基础。
English
Unified multimodal models aim to jointly enable visual understanding and
generation, yet current benchmarks rarely examine their true integration.
Existing evaluations either treat the two abilities in isolation or overlook
tasks that inherently couple them. To address this gap, we present Uni-MMMU, a
comprehensive and discipline-aware benchmark that systematically unfolds the
bidirectional synergy between generation and understanding across eight
reasoning-centric domains, including science, coding, mathematics, and puzzles.
Each task is bidirectionally coupled, demanding models to (i) leverage
conceptual understanding to guide precise visual synthesis, or (ii) utilize
generation as a cognitive scaffold for analytical reasoning. Uni-MMMU
incorporates verifiable intermediate reasoning steps, unique ground truths, and
a reproducible scoring protocol for both textual and visual outputs. Through
extensive evaluation of state-of-the-art unified, generation-only, and
understanding-only models, we reveal substantial performance disparities and
cross-modal dependencies, offering new insights into when and how these
abilities reinforce one another, and establishing a reliable foundation for
advancing unified models.