MME-Reasoning:多模态大语言模型逻辑推理综合基准测试
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
May 27, 2025
作者: Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue
cs.AI
摘要
邏輯推理是人類智能的核心要素,也是多模態大語言模型(MLLMs)不可或缺的能力。儘管多模態推理領域取得了顯著進展,但現有的基準測試未能全面評估其推理能力,原因在於缺乏對邏輯推理類型的明確分類以及對推理理解的模糊性。為解決這些問題,我們提出了MME-Reasoning,這是一個旨在評估MLLMs推理能力的綜合基準測試,其問題涵蓋了所有三種推理類型(即歸納、演繹和溯因)。我們精心策劃數據,確保每個問題都能有效評估推理能力而非感知技能或知識廣度,並擴展評估協議以涵蓋多樣化問題的評估。我們的評估揭示了在對邏輯推理能力進行全面評估時,最先進的MLLMs存在顯著限制。即使是最先進的MLLMs在綜合邏輯推理中也表現出有限的性能,且在不同推理類型間存在明顯的性能失衡。此外,我們深入分析了如「思維模式」和基於規則的強化學習等方法,這些方法通常被認為能增強推理能力。這些發現凸顯了當前MLLMs在多樣化邏輯推理場景中的關鍵限制和性能失衡,為理解和評估推理能力提供了全面且系統的見解。
English
Logical reasoning is a fundamental aspect of human intelligence and an
essential capability for multimodal large language models (MLLMs). Despite the
significant advancement in multimodal reasoning, existing benchmarks fail to
comprehensively evaluate their reasoning abilities due to the lack of explicit
categorization for logical reasoning types and an unclear understanding of
reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive
benchmark designed to evaluate the reasoning ability of MLLMs, which covers all
three types of reasoning (i.e., inductive, deductive, and abductive) in its
questions. We carefully curate the data to ensure that each question
effectively evaluates reasoning ability rather than perceptual skills or
knowledge breadth, and extend the evaluation protocols to cover the evaluation
of diverse questions. Our evaluation reveals substantial limitations of
state-of-the-art MLLMs when subjected to holistic assessments of logical
reasoning capabilities. Even the most advanced MLLMs show limited performance
in comprehensive logical reasoning, with notable performance imbalances across
reasoning types. In addition, we conducted an in-depth analysis of approaches
such as ``thinking mode'' and Rule-based RL, which are commonly believed to
enhance reasoning abilities. These findings highlight the critical limitations
and performance imbalances of current MLLMs in diverse logical reasoning
scenarios, providing comprehensive and systematic insights into the
understanding and evaluation of reasoning capabilities.Summary
AI-Generated Summary