多模态数学推理基准测试：显式视觉依赖关系

摘要

近期，大型视觉语言模型（LVLMs）的进展显著提升了其整合视觉与语言信息的能力，在物体识别、图像描述及视觉问答等任务上达到了接近人类的水平。然而，当前的基准测试多聚焦于知识导向的评估，侧重于衡量特定领域的专业知识，往往忽视了模型在基础数学元素与视觉概念推理上的核心能力。我们识别出在评估依赖明确视觉关联的基础数学问题方面存在空白，这类问题要求模型能够辨别、整合并跨多幅图像进行推理，同时融入常识知识，这些能力对于推动更广泛的通用人工智能（AGI）发展至关重要。为填补这一空白，我们推出了VCBENCH，一个针对具有明确视觉依赖性的多模态数学推理的全面基准。VCBENCH包含跨越六个认知领域的1,720道题目，涉及6,697张图像（平均每道题3.9张），以确保多图像推理的需求。我们对26个顶尖的LVLMs在VCBENCH上进行了评估，结果显示性能差异显著，即便是表现最佳的模型准确率也未能超过50%。我们的研究结果凸显了视觉与数学整合方面持续存在的挑战，并为未来LVLMs的发展指明了方向。

English

Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.

多模态数学推理基准测试：显式视觉依赖关系

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

摘要

Support