MME-Reasoning: MLLM을 위한 논리적 추론 종합 벤치마크

초록

논리적 추론은 인간 지능의 근본적인 측면이자 멀티모달 대형 언어 모델(MLLM)의 필수적인 능력입니다. 멀티모달 추론의 상당한 발전에도 불구하고, 기존 벤치마크는 논리적 추론 유형에 대한 명시적 분류의 부재와 추론에 대한 불명확한 이해로 인해 그들의 추론 능력을 포괄적으로 평가하지 못하고 있습니다. 이러한 문제를 해결하기 위해, 우리는 MLLM의 추론 능력을 평가하기 위한 포괄적인 벤치마크인 MME-Reasoning을 소개합니다. 이 벤치마크는 질문에서 귀납적, 연역적, 그리고 귀추적 추론이라는 세 가지 유형의 추론을 모두 다룹니다. 우리는 각 질문이 지각 능력이나 지식의 폭이 아닌 추론 능력을 효과적으로 평가할 수 있도록 데이터를 신중하게 선별하고, 다양한 질문의 평가를 포함하도록 평가 프로토콜을 확장했습니다. 우리의 평가는 논리적 추론 능력에 대한 전체적 평가에서 최첨단 MLLM의 상당한 한계를 드러냈습니다. 가장 발전된 MLLM조차도 포괄적인 논리적 추론에서 제한된 성능을 보였으며, 추론 유형 간에 현저한 성능 불균형이 나타났습니다. 또한, 우리는 추론 능력을 향상시킬 것으로 일반적으로 여겨지는 "사고 모드" 및 규칙 기반 RL과 같은 접근 방식에 대한 심층 분석을 수행했습니다. 이러한 발견들은 다양한 논리적 추론 시나리오에서 현재 MLLM의 중요한 한계와 성능 불균형을 강조하며, 추론 능력의 이해와 평가에 대한 포괄적이고 체계적인 통찰을 제공합니다.

English

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.

MME-Reasoning: MLLM을 위한 논리적 추론 종합 벤치마크

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

초록

Support