MMMR：大規模多模態推理任務基準測試

摘要

近期，多模态大语言模型（MLLMs）的进展实现了对语言、视觉及结构化输入的统一处理，为逻辑推理、空间推理及科学分析等复杂任务开启了大门。尽管前景广阔，MLLMs，尤其是那些通过中间思维轨迹增强的模型（MLLMs-T），其推理能力仍鲜为人知，且缺乏标准化的评估基准。现有研究主要关注感知或最终答案的正确性，对模型跨模态推理或失败的方式提供有限洞察。为填补这一空白，我们引入了MMMR，一个旨在严格评估带有明确思维的多模态推理的新基准。MMMR包含：1）一个高难度数据集，涵盖六种多样化推理类型的1083个问题，具有符号深度和多跳需求；2）一个模块化的推理轨迹评估管道（RTEP），用于通过相关性、一致性及结构化错误注释等指标，超越准确性评估推理质量。实证结果显示，MLLMs-T总体上优于无思维增强的模型，但即使是Claude-3.7-Sonnet和Gemini-2.5 Pro等顶尖模型，也面临不一致和过度思考等推理缺陷。此基准揭示了准确性与推理质量之间的持续差距，并为未来模型开发提供了可操作的评估管道。总体而言，MMMR为评估、比较及改进下一代多模态推理系统奠定了可扩展的基础。

English

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.