SciVer：评估多模态科学声明验证的基础模型

摘要

我们推出了SciVer，这是首个专门用于评估基础模型在多模态科学背景下验证声明能力的基准测试。SciVer包含3,000个专家标注的示例，覆盖1,113篇科学论文，分为四个子集，每个子集代表多模态科学声明验证中常见的一种推理类型。为支持细粒度评估，每个示例均附有专家标注的支持证据。我们评估了21种最先进的多模态基础模型的表现，包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。实验结果显示，这些模型在SciVer上的表现与人类专家之间存在显著差距。通过深入分析检索增强生成（RAG）以及人工进行的错误评估，我们识别出当前开源模型的关键局限，为提升模型在多模态科学文献任务中的理解与推理能力提供了重要洞见。

English

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

SciVer：评估多模态科学声明验证的基础模型

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

摘要

Support