CSVQA：面向视觉语言模型STEM推理能力评估的中文多模态基准

摘要

视觉-语言模型（VLMs）在多模态理解方面取得了显著进展，但其科学推理能力仍缺乏充分评估。当前的多模态基准主要针对通用图像理解或文本驱动推理进行评估，缺少需要将领域特定知识与视觉证据分析相结合的真实科学场景。为填补这一空白，我们提出了CSVQA，一个专门设计用于通过领域扎根的视觉问答来评估科学推理的诊断性多模态基准。我们的基准包含1,378个精心构建的跨学科STEM问题-答案对，每个问题均要求具备领域知识、视觉证据整合及高阶推理能力。与以往的多模态基准相比，CSVQA更注重现实世界的科学内容与复杂推理。此外，我们提出了一套严格的评估协议，以系统性地检验模型预测是否基于经过筛选的解释而得到有效的中间推理步骤支持。我们对15个VLMs在该基准上的全面评估揭示了显著的性能差异，即便是排名最高的专有模型，其准确率也仅为49.6%。这一实证结果凸显了提升VLMs科学推理能力的迫切需求。我们的CSVQA已发布于https://huggingface.co/datasets/Skywork/CSVQA。

English

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.

CSVQA：面向视觉语言模型STEM推理能力评估的中文多模态基准

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

摘要

Support