MathReal：我们追求真实！一个用于评估多模态大语言模型数学推理能力的真实场景基准

摘要

多模态大语言模型（MLLMs）在各类现有基准测试中展现了卓越的视觉数学推理能力。然而，这些基准测试大多基于清洁或处理过的多模态输入，并未包含真实世界K-12教育用户提供的图像。为填补这一空白，我们推出了MathReal，这是一个精心策划的数据集，包含2000道数学题目，这些题目的图像均通过手持移动设备在真实场景中拍摄。每道题目以图像形式呈现，包含问题文本与视觉元素。我们系统地将这些真实图像归为三大主要类别：图像质量下降、视角变化及无关内容干扰，并进一步细分为14个子类别。此外，MathReal覆盖了五大核心知识与能力类别，包含三种题型，并按难度划分为三个等级。为了全面评估顶尖MLLMs在现实场景下的多模态数学推理能力，我们设计了六种实验设置，以系统分析其表现。通过大量实验，我们发现现有MLLMs在真实教育情境中的解题能力面临显著挑战。基于此，我们对其表现及错误模式进行了深入分析，揭示了其在识别、理解与推理方面的能力，并指明了未来改进的方向。数据与代码详见：https://github.com/junfeng0288/MathReal。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.