CSVQA: STEM推論能力を評価するための中国語マルチモーダルベンチマーク

要旨

ビジョン・ランゲージモデル（VLMs）はマルチモーダル理解において顕著な進歩を示してきたが、科学的推論能力については十分に評価されていない。現在のマルチモーダルベンチマークは、主に一般的な画像理解やテキスト駆動型の推論を評価するものであり、視覚的証拠の分析とドメイン固有の知識の統合を必要とする本物の科学的文脈が欠けている。このギャップを埋めるため、我々はCSVQAを提案する。これは、ドメインに基づいた視覚的質問応答を通じて科学的推論を評価するために特別に設計された診断型マルチモーダルベンチマークである。我々のベンチマークは、多様なSTEM分野にわたる1,378の慎重に構築された質問-回答ペアを特徴としており、それぞれがドメイン知識、視覚的証拠の統合、および高次推論を要求する。従来のマルチモーダルベンチマークと比較して、CSVQAは現実世界の科学的コンテンツと複雑な推論に重点を置いている。さらに、我々は、モデルの予測がキュレートされた説明に基づく有効な中間推論ステップによって裏付けられているかどうかを体系的に評価するための厳密な評価プロトコルを提案する。このベンチマークにおける15のVLMsの包括的評価は、トップランクのプロプライエタリモデルでさえ49.6%の精度しか達成できないという顕著な性能差を明らかにした。この経験的証拠は、VLMsの科学的推論能力を向上させるための緊急の必要性を強調している。我々のCSVQAはhttps://huggingface.co/datasets/Skywork/CSVQAで公開されている。

English

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.

CSVQA: STEM推論能力を評価するための中国語マルチモーダルベンチマーク

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

要旨

Support