MMMU-Pro:一個更穩健的多學科多模態理解基準。
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
September 4, 2024
作者: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
cs.AI
摘要
本文介紹了MMMU-Pro,這是Massive Multi-discipline Multimodal Understanding and Reasoning(MMMU)基準的穩健版本。MMMU-Pro通過基於MMMU的三步驟過程嚴格評估多模態模型的真正理解和推理能力:(1)過濾僅可由純文本模型回答的問題,(2)擴充候選選項,以及(3)引入僅視覺輸入設置,其中問題嵌入在圖像中。這種設置挑戰AI同時真正“看到”和“閱讀”,測試無縫整合視覺和文本信息的基本人類認知技能。結果顯示,在MMMU-Pro上,模型性能顯著低於MMMU,各模型的範圍從16.8%到26.9%不等。我們探討了OCR提示和Chain of Thought(CoT)推理的影響,發現OCR提示影響較小,而CoT通常提高性能。MMMU-Pro提供了一個更嚴格的評估工具,緊密模擬現實情境,為未來多模態AI研究提供寶貴方向。
English
This paper introduces MMMU-Pro, a robust version of the Massive
Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and
reasoning capabilities through a three-step process based on MMMU: (1)
filtering out questions answerable by text-only models, (2) augmenting
candidate options, and (3) introducing a vision-only input setting where
questions are embedded within images. This setting challenges AI to truly "see"
and "read" simultaneously, testing a fundamental human cognitive skill of
seamlessly integrating visual and textual information. Results show that model
performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8%
to 26.9% across models. We explore the impact of OCR prompts and Chain of
Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT
generally improves performance. MMMU-Pro provides a more rigorous evaluation
tool, closely mimicking real-world scenarios and offering valuable directions
for future research in multimodal AI.Summary
AI-Generated Summary