SciVideoBench:大型多模态模型科学视频推理能力基准测试
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
October 9, 2025
作者: Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang
cs.AI
摘要
大型多模态模型(LMMs)在多种能力上取得了显著进展;然而,在科学领域进行复杂的视频推理仍然是一个重要且充满挑战的前沿。当前的视频基准主要针对依赖大量感知/识别的通用场景,而推理任务相对简单,导致性能饱和,无法有效评估高级多模态认知技能。为解决这一关键缺口,我们推出了SciVideoBench,这是一个专门设计用于评估科学背景下高级视频推理的严格基准。SciVideoBench包含1000道精心设计的选择题,这些题目源自跨越25个专业学术领域的前沿科学实验视频,并通过半自动系统验证。每道题目都需要深入的领域知识、精确的时空感知以及复杂的逻辑推理,有效挑战模型的高阶认知能力。我们的评估揭示了包括Gemini 2.5 Pro和Qwen2.5-VL在内的最先进专有和开源LMMs在视频推理能力上的显著不足,表明其仍有巨大的提升空间。对推理复杂性和视觉基础等关键因素的详细分析,为LMMs的未来发展提供了宝贵的见解和明确的方向,推动真正具备能力的多模态AI合作科学家的进化。我们希望SciVideoBench能够契合社区的兴趣,并助力推动前沿AI在更广泛科学领域的边界拓展。
English
Large Multimodal Models (LMMs) have achieved remarkable progress across
various capabilities; however, complex video reasoning in the scientific domain
remains a significant and challenging frontier. Current video benchmarks
predominantly target general scenarios where perception/recognition is heavily
relied on, while with relatively simple reasoning tasks, leading to saturation
and thus failing to effectively evaluate advanced multimodal cognitive skills.
To address this critical gap, we introduce SciVideoBench, a rigorous benchmark
specifically designed to assess advanced video reasoning in scientific
contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice
questions derived from cutting-edge scientific experimental videos spanning
over 25 specialized academic subjects and verified by a semi-automatic system.
Each question demands sophisticated domain-specific knowledge, precise
spatiotemporal perception, and intricate logical reasoning, effectively
challenging models' higher-order cognitive abilities. Our evaluation highlights
significant performance deficits in state-of-the-art proprietary and
open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating
substantial room for advancement in video reasoning capabilities. Detailed
analyses of critical factors such as reasoning complexity and visual grounding
provide valuable insights and clear direction for future developments in LMMs,
driving the evolution of truly capable multimodal AI co-scientists. We hope
SciVideoBench could fit the interests of the community and help to push the
boundary of cutting-edge AI for border science.