MMMR：大规模多模态推理任务基准测试

摘要

多模态大语言模型（MLLMs）的最新进展实现了对语言、视觉和结构化输入的统一处理，为逻辑推理、空间推理和科学分析等复杂任务打开了大门。尽管前景广阔，但MLLMs的推理能力，尤其是那些通过中间思维轨迹增强的模型（MLLMs-T），仍鲜为人知，且缺乏标准化的评估基准。现有研究主要关注感知或最终答案的正确性，对模型在多模态间如何推理或失败提供了有限的洞察。为填补这一空白，我们引入了MMMR，这是一个旨在严格评估多模态显性推理的新基准。MMMR包含：1）一个高难度数据集，涵盖六种不同推理类型的1083个问题，具有符号深度和多跳需求；2）一个模块化的推理轨迹评估管道（RTEP），通过相关性、一致性和结构化错误注释等指标，评估推理质量而不仅仅是准确性。实证结果表明，MLLMs-T总体上优于非思维增强的模型，但即使是Claude-3.7-Sonnet和Gemini-2.5 Pro等顶级模型，也存在不一致和过度思考等推理缺陷。该基准揭示了准确性与推理质量之间的持续差距，并为未来模型开发提供了可操作的评估管道。总体而言，MMMR为评估、比较和改进下一代多模态推理系统提供了一个可扩展的基础。

English

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

MMMR：大规模多模态推理任务基准测试

MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

摘要

Support