MMMU-Pro:一个更强大的多学科多模态理解基准。
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
September 4, 2024
作者: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
cs.AI
摘要
本文介绍了MMMU-Pro,是Massive Multi-discipline Multimodal Understanding and Reasoning(MMMU)基准测试的一个强大版本。MMMU-Pro通过基于MMMU的三步过程严格评估多模态模型的真实理解和推理能力:(1)过滤出仅能由文本模型回答的问题,(2)扩充候选选项,以及(3)引入仅包含视觉输入的设置,其中问题嵌入在图像中。这种设置挑战AI同时真正“看到”和“阅读”,测试无缝整合视觉和文本信息的基本人类认知技能。结果显示,在MMMU-Pro上,模型性能明显低于在MMMU上,各模型的性能降低范围从16.8%到26.9%不等。我们探讨了OCR提示和Chain of Thought(CoT)推理的影响,发现OCR提示影响微乎其微,而CoT通常会提高性能。MMMU-Pro提供了一个更严格的评估工具,紧密模拟真实场景,并为未来多模态AI研究提供宝贵方向。
English
This paper introduces MMMU-Pro, a robust version of the Massive
Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and
reasoning capabilities through a three-step process based on MMMU: (1)
filtering out questions answerable by text-only models, (2) augmenting
candidate options, and (3) introducing a vision-only input setting where
questions are embedded within images. This setting challenges AI to truly "see"
and "read" simultaneously, testing a fundamental human cognitive skill of
seamlessly integrating visual and textual information. Results show that model
performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8%
to 26.9% across models. We explore the impact of OCR prompts and Chain of
Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT
generally improves performance. MMMU-Pro provides a more rigorous evaluation
tool, closely mimicking real-world scenarios and offering valuable directions
for future research in multimodal AI.Summary
AI-Generated Summary