SciVer：評估多模態科學主張驗證的基礎模型

摘要

我們推出了SciVer，這是首個專門設計用於評估基礎模型在多模態科學背景下驗證聲明能力的基準。SciVer包含1,113篇科學論文中的3,000個專家註釋範例，涵蓋四個子集，每個子集代表多模態科學聲明驗證中的一種常見推理類型。為了實現細粒度評估，每個範例都包含專家註釋的支持證據。我們評估了21個最先進的多模態基礎模型的性能，包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。我們的實驗揭示了這些模型與人類專家在SciVer上的顯著性能差距。通過對檢索增強生成（RAG）的深入分析以及人工進行的錯誤評估，我們識別出當前開源模型的關鍵限制，為提升模型在多模態科學文獻任務中的理解和推理能力提供了重要見解。

English

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

SciVer：評估多模態科學主張驗證的基礎模型

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

摘要

Support