MME-Reasoning: MLLMにおける論理的推論のための包括的ベンチマーク

要旨

論理的推論は人間の知性の基本的な側面であり、マルチモーダル大規模言語モデル（MLLM）にとって不可欠な能力です。マルチモーダル推論における大きな進展にもかかわらず、既存のベンチマークは、論理的推論のタイプに対する明示的な分類の欠如と推論の理解が不明確であるため、その推論能力を包括的に評価できていません。これらの問題に対処するため、私たちはMME-Reasoningを導入しました。これは、MLLMの推論能力を評価するために設計された包括的なベンチマークであり、その質問において帰納的、演繹的、および仮説的推論の3つのタイプをすべてカバーしています。私たちはデータを慎重に選定し、各質問が知覚スキルや知識の広さではなく推論能力を効果的に評価することを保証し、多様な質問の評価をカバーするために評価プロトコルを拡張しました。私たちの評価は、論理的推論能力の包括的評価にさらされた最先端のMLLMの重大な限界を明らかにしています。最も先進的なMLLMでさえ、包括的な論理的推論において限定的な性能を示し、推論タイプ間で顕著な性能の不均衡が見られました。さらに、推論能力を向上させると一般的に信じられている「思考モード」やルールベースのRLなどのアプローチについて詳細な分析を行いました。これらの発見は、多様な論理的推論シナリオにおける現在のMLLMの重要な限界と性能の不均衡を強調し、推論能力の理解と評価に関する包括的かつ体系的な洞察を提供します。

English

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.

MME-Reasoning: MLLMにおける論理的推論のための包括的ベンチマーク

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

要旨

Support