GSM8K-V: 시각 언어 모델이 시각적 맥락에서 초등학교 수학 문장제 문제를 해결할 수 있는가

초록

비전 언어 모델(VLMs)은 이미지와 텍스트의 통합 모델링을 달성하여, 인지, 계획, 추론을 통해 복잡한 현실 세계의 과제를 수행할 수 있게 합니다. 이러한 과제 중에서도 추론은 특히 대표적인 예로, 수학적 추론이 두드러진 사례입니다. 이는 VLMs이 이미지 내 수학적 정보를 이해하고 정교한 추론을 수행할 수 있는 고차원적 능력을 강조합니다. 최근에는 다양한 시각적 수학적 추론 벤치마크가 제안되었지만, 이들은 주로 기하학에 국한되거나 수학 단어 문제를 다루지 않으며, 여러 이미지에 걸친 추론을 평가하는 경우는 드뭅니다. 이러한 격차를 해결하기 위해, 우리는 순수 시각적 다중 이미지 수학적 추론 벤치마크인 GSM8K-V를 소개합니다. GSM8K-V는 널리 사용되는 텍스트 기반 GSM8K의 각 샘플을 체계적으로 시각적 형태로 매핑하여 구축되었습니다. 신중하게 설계된 자동 이미지 생성 파이프라인과 꼼꼼한 인간 주석을 결합하여, 1,319개의 고품질 샘플을 선별했습니다. 우리는 GSM8K-V에서 다양한 오픈소스와 클로즈드소스 모델을 평가했습니다. 결과는 기존 VLMs이 텍스트 기반 GSM8K에서는 거의 포화 상태의 성능을 보이지만, GSM8K-V에서는 여전히 상당한 개선의 여지가 있음을 보여줍니다. 예를 들어, 최고 성능 모델인 Gemini-2.5-Pro는 GSM8K에서 95.22%의 정확도를 달성했지만, GSM8K-V에서는 46.93%에 그쳤습니다. 우리는 GSM8K-V에 대한 포괄적인 분석을 수행하여, 현재 모델의 한계와 개선 가능한 방향을 검토했습니다. GSM8K-V는 시각적 수학적 추론에 대한 새로운 관점을 제공하며, 더 강력하고 일반화 가능한 VLMs의 개발을 이끌 벤치마크를 설정합니다.

English

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.