SciVer: 다중모달 과학적 주장 검증을 위한 파운데이션 모델 평가

초록

우리는 다중모달 과학적 맥락에서 주장을 검증하는 기초 모델의 능력을 평가하기 위해 특별히 설계된 첫 번째 벤치마크인 SciVer를 소개합니다. SciVer는 1,113편의 과학 논문에 걸쳐 전문가가 주석을 단 3,000개의 예시로 구성되어 있으며, 각각 다중모달 과학적 주장 검증에서 흔히 나타나는 추론 유형을 대표하는 네 가지 하위 집합을 포함합니다. 세밀한 평가를 가능하게 하기 위해, 각 예시에는 전문가가 주석을 단 지원 증거가 포함되어 있습니다. 우리는 o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, Qwen2.5-VL을 포함한 21개의 최첨단 다중모달 기초 모델의 성능을 평가했습니다. 실험 결과, 이러한 모델과 인간 전문가 간에 SciVer에서 상당한 성능 격차가 있음이 밝혀졌습니다. 검색 증강 생성(RAG)과 인간이 수행한 오류 평가를 통해, 우리는 현재 오픈소스 모델의 중요한 한계를 식별하고, 다중모달 과학 문헌 작업에서 모델의 이해와 추론 능력을 발전시키기 위한 핵심 통찰을 제공합니다.

English

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

SciVer: 다중모달 과학적 주장 검증을 위한 파운데이션 모델 평가

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

초록

Support