MMR-Life：融合现实场景的多模态多图像推理拼图

摘要

多模态大语言模型（MLLMs）推理能力的近期突破，使其能够应对更复杂的任务，如科学分析和数学推理。尽管前景广阔，MLLMs在现实生活不同场景中的推理能力仍待深入探索，且缺乏标准化的评估基准。为填补这一空白，我们推出MMR-Life——一个专为评估MLLMs在真实生活场景中多样化多图像推理能力而设计的综合基准。该基准包含基于19,108张主要源自真实场景图像的2,646道选择题，全面涵盖七类推理类型：溯因推理、类比推理、因果推理、演绎推理、归纳推理、空间推理及时序推理。与现有推理基准不同，MMR-Life不依赖领域专业知识，而是要求模型整合多图像信息并运用多样化推理能力。对37个先进模型的评估结果表明，MMR-Life构成了重大挑战：即便顶尖模型如GPT-5也仅达到58%的准确率，且在不同推理类型间表现差异显著。此外，我们分析了现有MLLMs的推理范式，探究思维长度、推理方法与推理类型等因素如何影响其性能。总体而言，MMR-Life为评估、分析和改进下一代多模态推理系统奠定了全面基础。

English

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.