SciVideoBench: 大規模マルチモーダルモデルにおける科学的映像推論のベンチマーキング

要旨

大規模マルチモーダルモデル（LMMs）は、さまざまな能力において顕著な進歩を遂げてきた。しかし、科学分野における複雑な映像推論は、依然として重要な課題であり、挑戦的なフロンティアである。現在の映像ベンチマークは、主に一般的なシナリオを対象としており、認識・識別に大きく依存している一方で、比較的単純な推論タスクが中心となっているため、飽和状態に陥り、高度なマルチモーダル認知能力を効果的に評価することができていない。この重要なギャップを埋めるため、我々は科学コンテキストにおける高度な映像推論を評価するために特別に設計された厳密なベンチマーク「SciVideoBench」を導入する。SciVideoBenchは、25以上の専門的な学術分野にわたる最先端の科学実験映像から派生した1,000の慎重に作成された多肢選択問題で構成されており、半自動システムによって検証されている。各問題は、高度な分野固有の知識、正確な時空間認識、そして複雑な論理的推論を要求し、モデルの高次認知能力に効果的に挑戦する。我々の評価では、Gemini 2.5 ProやQwen2.5-VLを含む最先端のプロプライエタリおよびオープンソースのLMMsにおいて、映像推論能力に大きな性能不足が明らかになり、さらなる進歩の余地があることが示された。推論の複雑さや視覚的基盤といった重要な要因の詳細な分析は、LMMsの将来の発展に向けた貴重な洞察と明確な方向性を提供し、真に有能なマルチモーダルAI共同研究者の進化を促進する。我々は、SciVideoBenchがコミュニティの関心に合致し、最先端AIの境界を広げるための一助となることを期待している。

English

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

SciVideoBench: 大規模マルチモーダルモデルにおける科学的映像推論のベンチマーキング

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

要旨

Support