SciVer: マルチモーダル科学的主張検証のための基盤モデルの評価

要旨

我々は、マルチモーダルな科学文脈における主張の検証能力を評価するために特別に設計された初のベンチマーク「SciVer」を紹介する。SciVerは1,113の科学論文に基づく3,000の専門家注釈付き例で構成され、マルチモーダル科学主張検証における一般的な推論タイプを代表する4つのサブセットをカバーしている。詳細な評価を可能にするため、各例には専門家による注釈付きの支持証拠が含まれている。我々は、o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision、Qwen2.5-VLを含む21の最先端マルチモーダル基盤モデルの性能を評価した。実験の結果、これらのモデルと人間の専門家との間にSciVerにおいて大きな性能差があることが明らかになった。検索拡張生成（RAG）の詳細な分析と人間によるエラー評価を通じて、現在のオープンソースモデルにおける重要な限界を特定し、マルチモーダル科学文献タスクにおけるモデルの理解と推論を進めるための重要な洞察を提供する。

English

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

SciVer: マルチモーダル科学的主張検証のための基盤モデルの評価

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

要旨

Support