走向世界模拟器:打造基于物理常识的视频生成基准。
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
October 7, 2024
作者: Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo
cs.AI
摘要
像Sora这样的文本到视频(T2V)模型在可视化复杂提示方面取得了重大进展,这被越来越多地视为通向构建通用世界模拟器的有前途的途径。认知心理学家认为,实现这一目标的基础是理解直觉物理。然而,这些模型准确表示直觉物理的能力仍然大部分未被探索。为了弥合这一差距,我们引入了PhyGenBench,一个旨在评估T2V生成中物理常识正确性的综合物理生成基准。PhyGenBench包括160个精心设计的提示,涵盖27个不同的物理定律,涵盖四个基本领域,可以全面评估模型对物理常识的理解。除了PhyGenBench,我们提出了一个名为PhyGenEval的新颖评估框架。该框架采用分层评估结构,利用适当的先进视觉-语言模型和大型语言模型来评估物理常识。通过PhyGenBench和PhyGenEval,我们可以进行大规模的自动评估,评估T2V模型对物理常识的理解,这与人类反馈密切相关。我们的评估结果和深入分析表明,当前模型难以生成符合物理常识的视频。此外,仅仅扩大模型规模或使用提示工程技术是不足以完全解决PhyGenBench(例如,动态场景)提出的挑战的。我们希望这项研究能激励社区将物理常识的学习置于这些模型的重要位置,超越娱乐应用。我们将在https://github.com/OpenGVLab/PhyGenBench发布数据和代码。
English
Text-to-video (T2V) models like Sora have made significant strides in
visualizing complex prompts, which is increasingly viewed as a promising path
towards constructing the universal world simulator. Cognitive psychologists
believe that the foundation for achieving this goal is the ability to
understand intuitive physics. However, the capacity of these models to
accurately represent intuitive physics remains largely unexplored. To bridge
this gap, we introduce PhyGenBench, a comprehensive Physics
Generation Benchmark designed to evaluate physical
commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully
crafted prompts across 27 distinct physical laws, spanning four fundamental
domains, which could comprehensively assesses models' understanding of physical
commonsense. Alongside PhyGenBench, we propose a novel evaluation framework
called PhyGenEval. This framework employs a hierarchical evaluation structure
utilizing appropriate advanced vision-language models and large language models
to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can
conduct large-scale automated assessments of T2V models' understanding of
physical commonsense, which align closely with human feedback. Our evaluation
results and in-depth analysis demonstrate that current models struggle to
generate videos that comply with physical commonsense. Moreover, simply scaling
up models or employing prompt engineering techniques is insufficient to fully
address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We
hope this study will inspire the community to prioritize the learning of
physical commonsense in these models beyond entertainment applications. We will
release the data and codes at https://github.com/OpenGVLab/PhyGenBenchSummary
AI-Generated Summary