SciVideoBench:大型多模態模型中的科學視頻推理基準測試
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
October 9, 2025
作者: Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang
cs.AI
摘要
大型多模態模型(LMMs)在各項能力上取得了顯著進展;然而,科學領域中的複雜視頻推理仍然是一個重要且具有挑戰性的前沿課題。目前的視頻基準主要針對依賴於感知/識別的通用場景,而推理任務相對簡單,導致性能飽和,無法有效評估高級多模態認知技能。為解決這一關鍵缺口,我們引入了SciVideoBench,這是一個專門設計用於評估科學背景下高級視頻推理的嚴格基準。SciVideoBench包含1000道精心設計的多選題,這些題目源自涵蓋25個以上專業學術領域的前沿科學實驗視頻,並通過半自動系統驗證。每道題目都需要深入的領域特定知識、精確的時空感知以及複雜的邏輯推理,有效挑戰模型的高階認知能力。我們的評估顯示,包括Gemini 2.5 Pro和Qwen2.5-VL在內的頂尖專有和開源LMMs在性能上存在顯著不足,表明視頻推理能力仍有很大的提升空間。對推理複雜性和視覺基礎等關鍵因素的詳細分析,為LMMs的未來發展提供了寶貴的見解和明確的方向,推動真正具備能力的多模態AI共同科學家的演進。我們希望SciVideoBench能夠契合社區的興趣,並幫助推動前沿AI在更廣泛科學領域的邊界拓展。
English
Large Multimodal Models (LMMs) have achieved remarkable progress across
various capabilities; however, complex video reasoning in the scientific domain
remains a significant and challenging frontier. Current video benchmarks
predominantly target general scenarios where perception/recognition is heavily
relied on, while with relatively simple reasoning tasks, leading to saturation
and thus failing to effectively evaluate advanced multimodal cognitive skills.
To address this critical gap, we introduce SciVideoBench, a rigorous benchmark
specifically designed to assess advanced video reasoning in scientific
contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice
questions derived from cutting-edge scientific experimental videos spanning
over 25 specialized academic subjects and verified by a semi-automatic system.
Each question demands sophisticated domain-specific knowledge, precise
spatiotemporal perception, and intricate logical reasoning, effectively
challenging models' higher-order cognitive abilities. Our evaluation highlights
significant performance deficits in state-of-the-art proprietary and
open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating
substantial room for advancement in video reasoning capabilities. Detailed
analyses of critical factors such as reasoning complexity and visual grounding
provide valuable insights and clear direction for future developments in LMMs,
driving the evolution of truly capable multimodal AI co-scientists. We hope
SciVideoBench could fit the interests of the community and help to push the
boundary of cutting-edge AI for border science.