MathReal:我们追求真实!一个用于评估多模态大语言模型数学推理能力的真实场景基准
MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models
August 8, 2025
作者: Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin
cs.AI
摘要
多模态大语言模型(MLLMs)在各类现有基准测试中展现了卓越的视觉数学推理能力。然而,这些基准测试大多基于清洁或处理过的多模态输入,并未包含真实世界K-12教育用户提供的图像。为填补这一空白,我们推出了MathReal,这是一个精心策划的数据集,包含2000道数学题目,这些题目的图像均通过手持移动设备在真实场景中拍摄。每道题目以图像形式呈现,包含问题文本与视觉元素。我们系统地将这些真实图像归为三大主要类别:图像质量下降、视角变化及无关内容干扰,并进一步细分为14个子类别。此外,MathReal覆盖了五大核心知识与能力类别,包含三种题型,并按难度划分为三个等级。为了全面评估顶尖MLLMs在现实场景下的多模态数学推理能力,我们设计了六种实验设置,以系统分析其表现。通过大量实验,我们发现现有MLLMs在真实教育情境中的解题能力面临显著挑战。基于此,我们对其表现及错误模式进行了深入分析,揭示了其在识别、理解与推理方面的能力,并指明了未来改进的方向。数据与代码详见:https://github.com/junfeng0288/MathReal。
English
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
capabilities in visual mathematical reasoning across various existing
benchmarks. However, these benchmarks are predominantly based on clean or
processed multimodal inputs, without incorporating the images provided by
real-world Kindergarten through 12th grade (K-12) educational users. To address
this gap, we introduce MathReal, a meticulously curated dataset comprising
2,000 mathematical questions with images captured by handheld mobile devices in
authentic scenarios. Each question is an image, containing the question text
and visual element. We systematically classify the real images into three
primary categories: image quality degradation, perspective variation, and
irrelevant content interference, which are further delineated into 14
subcategories. Additionally, MathReal spans five core knowledge and ability
categories, which encompass three question types and are divided into three
difficulty levels. To comprehensively evaluate the multimodal mathematical
reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we
design six experimental settings that enable a systematic analysis of their
performance. Through extensive experimentation, we find that the
problem-solving abilities of existing MLLMs are significantly challenged in
realistic educational contexts. Based on this, we conduct a thorough analysis
of their performance and error patterns, providing insights into their
recognition, comprehension, and reasoning capabilities, and outlining
directions for future improvements. Data and code:
https://github.com/junfeng0288/MathReal.