GSM8K-V: 視覚言語モデルは視覚的コンテキストにおける小学校レベルの数学文章題を解けるか

要旨

視覚言語モデル（VLMs）は、画像とテキストの統一的なモデリングを実現し、知覚、計画、推論を通じて複雑な現実世界のタスクを達成することを可能にします。これらのタスクの中でも、推論は特に代表的なものであり、数学的推論はその顕著な例です。これは、VLMsが画像内の数学的情報を理解し、高度な推論を行う能力の高さを示しています。最近、多くの視覚的数学的推論ベンチマークが提案されていますが、それらはしばしば幾何学に限定され、数学文章問題のカバー範囲が不足しており、複数の画像にわたる推論を評価することは稀です。これらのギャップを埋めるため、我々はGSM8K-Vという純粋に視覚的な複数画像数学的推論ベンチマークを導入します。GSM8K-Vは、広く使用されているテキストベースのGSM8Kの各サンプルを体系的に視覚形式にマッピングすることで構築されています。慎重に設計された自動画像生成パイプラインと緻密な人間によるアノテーションを組み合わせ、1,319の高品質なサンプルをキュレーションしました。我々は、オープンソースおよびクローズドソースの幅広いモデルをGSM8K-Vで評価します。結果は、既存のVLMsがテキストベースのGSM8Kではほぼ飽和した性能を示しているものの、GSM8K-Vではまだ大幅な改善の余地があることを示しています。例えば、最高性能のモデルであるGemini-2.5-Proは、GSM8Kでは95.22%の精度を達成していますが、GSM8K-Vでは46.93%に留まります。我々はGSM8K-Vの包括的な分析を行い、現在のモデルの限界と改善の可能性のある方向性を検証します。GSM8K-Vは、視覚的数学的推論に関する新たな視点を提供し、より堅牢で汎用性の高いVLMsの開発を導くベンチマークを確立します。

English

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

GSM8K-V: 視覚言語モデルは視覚的コンテキストにおける小学校レベルの数学文章題を解けるか

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

要旨

Support