MathReal：我們保持真實！一個真實場景基準用於評估多模態大語言模型中的數學推理能力

摘要

多模態大型語言模型（MLLMs）在現有的各種視覺數學推理基準測試中展現了卓越的能力。然而，這些基準測試主要基於乾淨或經過處理的多模態輸入，並未納入真實世界K-12教育用戶提供的圖像。為填補這一空白，我們引入了MathReal，這是一個精心策劃的數據集，包含2000道數學題目，這些題目的圖像均是在真實場景下通過手持移動設備拍攝的。每道題目都是一張圖像，包含題目文本和視覺元素。我們系統地將這些真實圖像分類為三大主要類別：圖像質量退化、視角變化和無關內容干擾，並進一步細分為14個子類別。此外，MathReal涵蓋了五個核心知識和能力類別，這些類別包含三種題型，並分為三個難度等級。為了全面評估最先進的MLLMs在真實場景中的多模態數學推理能力，我們設計了六種實驗設置，以便系統地分析其表現。通過大量實驗，我們發現現有MLLMs在真實教育情境中的解題能力面臨顯著挑戰。基於此，我們對其表現和錯誤模式進行了深入分析，提供了對其識別、理解和推理能力的見解，並為未來的改進方向提供了指導。數據和代碼請訪問：https://github.com/junfeng0288/MathReal。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.