SciVideoBench: 대규모 멀티모달 모델의 과학적 비디오 추론 능력 벤치마킹

초록

대규모 멀티모달 모델(LMMs)은 다양한 능력에서 놀라운 진전을 이루었으나, 과학 영역에서의 복잡한 비디오 추론은 여전히 중요한 도전 과제로 남아 있습니다. 현재의 비디오 벤치마크는 주로 일반적인 시나리오를 대상으로 하여 인식/재인에 크게 의존하고 비교적 단순한 추론 과제를 포함하고 있어, 포화 상태에 이르러 고급 멀티모달 인지 능력을 효과적으로 평가하지 못하고 있습니다. 이러한 중요한 격차를 해결하기 위해, 우리는 과학적 맥락에서의 고급 비디오 추론 능력을 평가하기 위해 특별히 설계된 엄격한 벤치마크인 SciVideoBench를 소개합니다. SciVideoBench는 25개 이상의 전문 학문 분야를 아우르는 최첨단 과학 실험 비디오에서 도출된 1,000개의 신중하게 구성된 객관식 문제로 구성되어 있으며, 반자동 시스템을 통해 검증되었습니다. 각 문제는 정교한 도메인 특화 지식, 정확한 시공간적 인식, 복잡한 논리적 추론을 요구하여 모델의 고차원적 인지 능력을 효과적으로 도전합니다. 우리의 평가는 Gemini 2.5 Pro와 Qwen2.5-VL을 포함한 최신의 독점 및 오픈소스 LMMs에서 상당한 성능 부족을 보여주며, 비디오 추론 능력의 발전을 위한 상당한 여지가 있음을 나타냅니다. 추론 복잡성과 시각적 근거와 같은 중요한 요소에 대한 상세한 분석은 LMMs의 미래 발전을 위한 귀중한 통찰과 명확한 방향을 제공하며, 진정으로 능력 있는 멀티모달 AI 공동 과학자의 진화를 이끌어줄 것입니다. 우리는 SciVideoBench가 커뮤니티의 관심에 부합하고, 최첨단 AI의 경계를 넓혀 더 넓은 과학 분야로 나아가는 데 도움이 되기를 바랍니다.

English

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

SciVideoBench: 대규모 멀티모달 모델의 과학적 비디오 추론 능력 벤치마킹

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

초록

Support