ChatPaper.aiChatPaper

"PhyWorldBench":文本到視頻模型物理真實性的全面評估

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

July 17, 2025
作者: Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang
cs.AI

摘要

影片生成模型在創造高品質、逼真內容方面取得了顯著進展。然而,它們準確模擬物理現象的能力仍然是一個關鍵且未解決的挑戰。本文提出了PhyWorldBench,這是一個全面的基準測試,旨在根據影片生成模型對物理定律的遵循程度進行評估。該基準涵蓋了多層次的物理現象,從物體運動和能量守恆等基本原則,到涉及剛體互動以及人類或動物運動的更複雜場景。此外,我們引入了一個新穎的「反物理」類別,其中提示故意違反現實世界的物理定律,從而能夠評估模型在遵循此類指令的同時是否仍能保持邏輯一致性。除了大規模的人類評估外,我們還設計了一種簡單而有效的方法,可以利用當前的多模態大語言模型(MLLM)以零樣本方式評估物理真實性。我們評估了12個最先進的文字到影片生成模型,包括五個開源模型和五個專有模型,並進行了詳細的比較和分析。通過對這些模型在1,050個精心設計的提示(涵蓋基本、複合和反物理場景)中的輸出進行系統測試,我們識別出這些模型在遵循現實世界物理定律時面臨的關鍵挑戰。隨後,我們嚴格檢驗了它們在不同提示類型下對多樣物理現象的表現,並得出了針對性的建議,以提升提示設計對物理原則的忠實度。
English
Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ""Anti-Physics"" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.
PDF151July 22, 2025