利用VideoScience-Bench对视频生成模型的科学理解与推理能力进行基准测试
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
December 2, 2025
作者: Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu, Yujie Zhao, Haoqiang Kang, Daniel Zhao, Tajana Rosing, Hao Zhang
cs.AI
摘要
视频生成的下一个前沿在于开发具备零样本推理能力的模型,这类模型需要理解现实世界的科学定律,才能准确模拟不同条件下的物理结果。然而现有视频基准大多基于物理常识,难以深入评估模型的科学推理能力。我们推出VideoScience-Bench基准测试,专门评估视频模型对本科层次科学原理的理解能力。每个提示词都包含复合科学场景,要求模型理解并综合运用多个科学概念才能生成正确现象。该基准涵盖物理和化学领域的14个主题、103个概念,共包含200个精心设计的提示词。我们针对七种前沿视频模型(涵盖文生视频和图生视频场景)进行了专家标注评估,从提示一致性、现象符合度、动态正确性、属性恒常性和时空连续性五个维度展开。通过采用视觉语言模型作为评判器对生成视频进行评估,我们发现其与人工评估结果呈现强相关性。据我们所知,VideoScience-Bench是首个将视频模型不仅视为生成器、更作为推理器进行评估的基准,要求生成内容展现出符合预期理化现象的科学理解能力。我们的数据与评估代码已开源:https://github.com/hao-ai-lab/VideoScience。
English
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: https://github.com/hao-ai-lab/VideoScience{github.com/hao-ai-lab/VideoScience}.