MME-Reasoning：多模态大语言模型逻辑推理综合基准测试

摘要

邏輯推理是人類智能的核心要素，也是多模態大語言模型（MLLMs）不可或缺的能力。儘管多模態推理領域取得了顯著進展，但現有的基準測試未能全面評估其推理能力，原因在於缺乏對邏輯推理類型的明確分類以及對推理理解的模糊性。為解決這些問題，我們提出了MME-Reasoning，這是一個旨在評估MLLMs推理能力的綜合基準測試，其問題涵蓋了所有三種推理類型（即歸納、演繹和溯因）。我們精心策劃數據，確保每個問題都能有效評估推理能力而非感知技能或知識廣度，並擴展評估協議以涵蓋多樣化問題的評估。我們的評估揭示了在對邏輯推理能力進行全面評估時，最先進的MLLMs存在顯著限制。即使是最先進的MLLMs在綜合邏輯推理中也表現出有限的性能，且在不同推理類型間存在明顯的性能失衡。此外，我們深入分析了如「思維模式」和基於規則的強化學習等方法，這些方法通常被認為能增強推理能力。這些發現凸顯了當前MLLMs在多樣化邏輯推理場景中的關鍵限制和性能失衡，為理解和評估推理能力提供了全面且系統的見解。

English

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.

MME-Reasoning：多模态大语言模型逻辑推理综合基准测试

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

摘要

Support