SciVideoBench：大型多模態模型中的科學視頻推理基準測試

摘要

大型多模態模型（LMMs）在各項能力上取得了顯著進展；然而，科學領域中的複雜視頻推理仍然是一個重要且具有挑戰性的前沿課題。目前的視頻基準主要針對依賴於感知/識別的通用場景，而推理任務相對簡單，導致性能飽和，無法有效評估高級多模態認知技能。為解決這一關鍵缺口，我們引入了SciVideoBench，這是一個專門設計用於評估科學背景下高級視頻推理的嚴格基準。SciVideoBench包含1000道精心設計的多選題，這些題目源自涵蓋25個以上專業學術領域的前沿科學實驗視頻，並通過半自動系統驗證。每道題目都需要深入的領域特定知識、精確的時空感知以及複雜的邏輯推理，有效挑戰模型的高階認知能力。我們的評估顯示，包括Gemini 2.5 Pro和Qwen2.5-VL在內的頂尖專有和開源LMMs在性能上存在顯著不足，表明視頻推理能力仍有很大的提升空間。對推理複雜性和視覺基礎等關鍵因素的詳細分析，為LMMs的未來發展提供了寶貴的見解和明確的方向，推動真正具備能力的多模態AI共同科學家的演進。我們希望SciVideoBench能夠契合社區的興趣，並幫助推動前沿AI在更廣泛科學領域的邊界拓展。

English

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

SciVideoBench：大型多模態模型中的科學視頻推理基準測試

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

摘要

Support