MMMU-Pro: 더 견고한 다학제 다중 모달 이해 벤치마크

초록

본 논문은 Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) 벤치마크의 강력한 버전인 MMMU-Pro를 소개합니다. MMMU-Pro는 MMMU를 기반으로 한 세 단계 프로세스를 통해 다중 모달 모델의 진정한 이해와 추론 능력을 엄격하게 평가합니다: (1) 텍스트만으로 답변 가능한 질문을 걸러내는 단계, (2) 후보 옵션을 보강하는 단계, (3) 이미지 내에 질문이 포함된 비전만 입력 설정을 도입하는 단계. 이 설정은 AI에게 동시에 "보고"하고 "읽는" 능력을 요구하여 시각적 및 텍스트 정보를 매끄럽게 통합하는 인간의 핵심 인지 기술을 테스트합니다. 결과는 모델 성능이 MMMU에 비해 MMMU-Pro에서 상당히 낮음을 보여줍니다. 모델별로 16.8%에서 26.9% 범위에 이르며, OCR 프롬프트와 Chain of Thought (CoT) 추론의 영향을 탐구했습니다. 결과는 OCR 프롬프트가 거의 영향을 미치지 않는 반면, CoT가 일반적으로 성능을 향상시킨다는 것을 보여줍니다. MMMU-Pro는 실제 시나리오를 밀접하게 모방하고 다중 모달 AI 연구의 가치 있는 방향을 제시하는 보다 엄격한 평가 도구를 제공합니다.

English

This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

MMMU-Pro: 더 견고한 다학제 다중 모달 이해 벤치마크

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

초록

Support