ChatPaper.aiChatPaper

利用VideoScience-Bench评估视频生成模型的科学理解与推理能力

Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

December 2, 2025
作者: Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu, Yujie Zhao, Haoqiang Kang, Daniel Zhao, Tajana Rosing, Hao Zhang
cs.AI

摘要

视频生成的下一个前沿在于开发具备零样本推理能力的模型,其中对现实世界科学定律的理解对于不同条件下精确物理结果建模至关重要。然而现有视频基准主要基于物理常识,难以深入评估视频模型的科学推理能力。我们推出VideoScience-Bench基准测试,专门用于评估视频模型对本科层次科学原理的理解能力。每个提示词均编码了复合科学场景,要求模型理解并综合运用多学科概念才能生成正确现象。该基准包含200个精心设计的提示词,涵盖物理和化学领域的14个主题、103个核心概念。我们针对七种前沿视频模型(含文生视频和图生视频场景)进行了专家标注的五维评估:提示一致性、现象符合度、动态正确性、属性恒常性和时空连续性。通过采用视觉语言模型作为评判器对视频生成结果进行评估,我们发现其与人工评估结果高度相关。据我们所知,VideoScience-Bench是首个将视频模型不仅视为生成器、更作为推理器进行评估的基准,要求生成内容展现出与预期物理化学现象相符的科学理解。我们的数据与评估代码已开源:https://github.com/hao-ai-lab/VideoScience。
English
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: https://github.com/hao-ai-lab/VideoScience{github.com/hao-ai-lab/VideoScience}.
PDF11December 4, 2025