GSM8K-V：視覺語言模型能否解決視覺情境中的小學數學文字題

摘要

視覺語言模型（VLMs）實現了圖像與文本的統一建模，使其能夠通過感知、規劃與推理來完成複雜的現實世界任務。在這些任務中，推理尤為具有代表性，其中數學推理作為一個顯著的例子，凸顯了VLMs在理解圖像中的數學信息及進行高級推理方面的高層次能力。近年來，眾多視覺數學推理基準被提出，但它們往往局限於幾何學，缺乏對數學文字問題的覆蓋，且鮮少評估跨多圖像的推理能力。為填補這些空白，我們引入了GSM8K-V，這是一個純視覺的多圖像數學推理基準。GSM8K-V通過系統性地將廣泛使用的基於文本的GSM8K中的每個樣本映射為視覺形式而構建。通過精心設計的自動圖像生成流程結合細緻的人工註釋，我們精選了1,319個高質量樣本。我們在GSM8K-V上評估了多種開源與閉源模型。結果顯示，儘管現有的VLMs在基於文本的GSM8K上性能已接近飽和，但在GSM8K-V上仍有顯著的提升空間。例如，表現最佳的模型Gemini-2.5-Pro在GSM8K上達到了95.22%的準確率，但在GSM8K-V上僅為46.93%。我們對GSM8K-V進行了全面分析，探討了當前模型的局限性以及潛在的改進方向。GSM8K-V為視覺數學推理提供了新的視角，並建立了一個基準，以指導開發更為穩健且泛化能力更強的VLMs。

English

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

GSM8K-V：視覺語言模型能否解決視覺情境中的小學數學文字題

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

摘要

Support