MathReal: 現実を追求する！マルチモーダル大規模言語モデルの数学的推論評価のための実世界シーンベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLMs）は、既存のさまざまなベンチマークにおいて、視覚的数学的推論において顕著な能力を発揮してきました。しかし、これらのベンチマークは主にクリーンまたは処理されたマルチモーダル入力を基にしており、現実世界の幼稚園から12年生（K-12）までの教育ユーザーが提供する画像を取り入れていません。このギャップを埋めるため、私たちはMathRealを導入します。これは、実際のシナリオで携帯型モバイルデバイスによって撮影された画像を含む2,000の数学的問題を慎重に選定したデータセットです。各問題は画像であり、問題文と視覚的要素を含んでいます。私たちは現実の画像を体系的に3つの主要カテゴリに分類します：画像品質の劣化、視点の変化、無関係な内容の干渉で、これらはさらに14のサブカテゴリに細分化されます。さらに、MathRealは5つの核心的な知識と能力カテゴリにまたがり、3つの問題タイプを含み、3つの難易度レベルに分かれています。最先端のMLLMsの現実世界におけるマルチモーダル数学的推論能力を包括的に評価するため、私たちは6つの実験設定を設計し、それらのパフォーマンスを体系的に分析します。広範な実験を通じて、既存のMLLMsの問題解決能力が現実の教育文脈において大幅に挑戦されていることがわかりました。これに基づいて、私たちはそれらのパフォーマンスとエラーパターンを徹底的に分析し、認識、理解、推論能力に関する洞察を提供し、将来の改善の方向性を示します。データとコードはこちら：https://github.com/junfeng0288/MathReal。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.

MathReal: 現実を追求する！マルチモーダル大規模言語モデルの数学的推論評価のための実世界シーンベンチマーク

MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

要旨

Support