"PhyWorldBench":文本到视频模型物理真实性的综合评估
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
July 17, 2025
作者: Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang
cs.AI
摘要
视频生成模型在创造高质量、逼真内容方面取得了显著进展。然而,它们准确模拟物理现象的能力仍是一个关键且未解决的挑战。本文提出了PhyWorldBench,一个全面的基准测试,旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了从物体运动和能量守恒等基本原理到涉及刚体相互作用及人类或动物运动等更复杂场景的多层次物理现象。此外,我们引入了一个新颖的“反物理”类别,其中提示有意违背现实世界的物理规律,从而评估模型在遵循此类指令的同时能否保持逻辑一致性。除了大规模的人类评估外,我们还设计了一种简单而有效的方法,可利用当前的多模态大语言模型(MLLM)以零样本方式评估物理真实性。我们对12个最先进的文本到视频生成模型进行了评估,包括五个开源模型和五个专有模型,并进行了详细的比较与分析。通过系统测试这些模型在1050个精心设计的提示(涵盖基础、复合及反物理场景)下的输出,我们识别出它们在遵循现实世界物理规律方面面临的关键挑战。随后,我们严格考察了它们在不同提示类型下对多样物理现象的表现,得出了旨在提升物理原理忠实度的提示设计针对性建议。
English
Video generation models have achieved remarkable progress in creating
high-quality, photorealistic content. However, their ability to accurately
simulate physical phenomena remains a critical and unresolved challenge. This
paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate
video generation models based on their adherence to the laws of physics. The
benchmark covers multiple levels of physical phenomena, ranging from
fundamental principles like object motion and energy conservation to more
complex scenarios involving rigid body interactions and human or animal motion.
Additionally, we introduce a novel ""Anti-Physics"" category, where prompts
intentionally violate real-world physics, enabling the assessment of whether
models can follow such instructions while maintaining logical consistency.
Besides large-scale human evaluation, we also design a simple yet effective
method that could utilize current MLLM to evaluate the physics realism in a
zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation
models, including five open-source and five proprietary models, with a detailed
comparison and analysis. we identify pivotal challenges models face in adhering
to real-world physics. Through systematic testing of their outputs across 1,050
curated prompts-spanning fundamental, composite, and anti-physics scenarios-we
identify pivotal challenges these models face in adhering to real-world
physics. We then rigorously examine their performance on diverse physical
phenomena with varying prompt types, deriving targeted recommendations for
crafting prompts that enhance fidelity to physical principles.