"PhyWorldBench"：文本到视频模型物理真实性的综合评估

摘要

视频生成模型在创造高质量、逼真内容方面取得了显著进展。然而，它们准确模拟物理现象的能力仍是一个关键且未解决的挑战。本文提出了PhyWorldBench，一个全面的基准测试，旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了从物体运动和能量守恒等基本原理到涉及刚体相互作用及人类或动物运动等更复杂场景的多层次物理现象。此外，我们引入了一个新颖的“反物理”类别，其中提示有意违背现实世界的物理规律，从而评估模型在遵循此类指令的同时能否保持逻辑一致性。除了大规模的人类评估外，我们还设计了一种简单而有效的方法，可利用当前的多模态大语言模型（MLLM）以零样本方式评估物理真实性。我们对12个最先进的文本到视频生成模型进行了评估，包括五个开源模型和五个专有模型，并进行了详细的比较与分析。通过系统测试这些模型在1050个精心设计的提示（涵盖基础、复合及反物理场景）下的输出，我们识别出它们在遵循现实世界物理规律方面面临的关键挑战。随后，我们严格考察了它们在不同提示类型下对多样物理现象的表现，得出了旨在提升物理原理忠实度的提示设计针对性建议。

English

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ""Anti-Physics"" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

"PhyWorldBench"：文本到视频模型物理真实性的综合评估

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

摘要

Support