GSM8K-V：视觉语言模型能否在视觉情境下解决小学数学应用题

摘要

视觉语言模型（VLMs）实现了图像与文本的统一建模，使其能够通过感知、规划和推理完成复杂的现实任务。在这些任务中，推理尤为典型，数学推理便是其中的突出例证，它凸显了VLMs理解图像中数学信息并进行高级推理的能力。近期，众多视觉数学推理基准被提出，但它们往往局限于几何领域，缺乏对数学文字题的覆盖，且很少评估跨多图像的推理能力。为填补这些空白，我们引入了GSM8K-V，一个纯视觉的多图像数学推理基准。GSM8K-V通过系统地将广泛使用的基于文本的GSM8K数据集中的每个样本映射为视觉形式构建而成。通过精心设计的自动化图像生成流程与细致的人工标注相结合，我们精选了1,319个高质量样本。我们对一系列开源和闭源模型在GSM8K-V上进行了评估。结果显示，尽管现有VLMs在基于文本的GSM8K上已接近性能饱和，但在GSM8K-V上仍有显著提升空间。例如，表现最佳的模型Gemini-2.5-Pro在GSM8K上达到了95.22%的准确率，而在GSM8K-V上仅为46.93%。我们对GSM8K-V进行了全面分析，探讨了当前模型的局限以及潜在的改进方向。GSM8K-V为视觉数学推理提供了新的视角，并建立了一个基准，以指导开发更强大、更具泛化能力的VLMs。

English

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

GSM8K-V：视觉语言模型能否在视觉情境下解决小学数学应用题

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

摘要

Support