MathReal: 현실을 그대로 담다! 멀티모달 대규모 언어 모델의 수학적 추론 능력 평가를 위한 실제 장면 벤치마크

초록

멀티모달 대형 언어 모델(MLLMs)은 기존의 다양한 벤치마크에서 시각적 수학적 추론 능력을 뛰어나게 보여주었습니다. 그러나 이러한 벤치마크는 주로 깔끔하거나 처리된 멀티모달 입력을 기반으로 하며, 실제 유치원부터 12학년(K-12) 교육 사용자가 제공한 이미지를 포함하지 않습니다. 이러한 격차를 해결하기 위해, 우리는 실제 시나리오에서 휴대용 모바일 기기로 촬영된 이미지와 함께 2,000개의 수학 문제로 구성된 신중하게 선별된 데이터셋인 MathReal을 소개합니다. 각 문제는 질문 텍스트와 시각적 요소를 포함한 이미지입니다. 우리는 실제 이미지를 이미지 품질 저하, 시각적 관점 변화, 관련 없는 내용 간섭이라는 세 가지 주요 범주로 체계적으로 분류하며, 이를 14개의 하위 범주로 세분화합니다. 또한, MathReal은 세 가지 문제 유형을 포함하고 세 가지 난이도 수준으로 나뉜 다섯 가지 핵심 지식 및 능력 범주를 아우릅니다. 최신 MLLMs의 멀티모달 수학적 추론 능력을 실제 시나리오에서 종합적으로 평가하기 위해, 우리는 그들의 성능을 체계적으로 분석할 수 있는 여섯 가지 실험 설정을 설계합니다. 광범위한 실험을 통해, 우리는 기존 MLLMs의 문제 해결 능력이 실제 교육 맥락에서 상당히 도전받는다는 것을 발견했습니다. 이를 바탕으로, 우리는 그들의 성능과 오류 패턴을 철저히 분석하여 인식, 이해, 추론 능력에 대한 통찰을 제공하고, 향후 개선 방향을 제시합니다. 데이터와 코드: https://github.com/junfeng0288/MathReal.

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.

MathReal: 현실을 그대로 담다! 멀티모달 대규모 언어 모델의 수학적 추론 능력 평가를 위한 실제 장면 벤치마크

MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

초록

Support