ChatPaper.aiChatPaper

走向世界模擬器:打造基於物理常識的視頻生成基準。

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

October 7, 2024
作者: Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo
cs.AI

摘要

像 Sora 這樣的文字轉視頻(T2V)模型在視覺化複雜提示方面取得了顯著進展,這被越來越多地認為是通往構建通用世界模擬器的一條有前途的途徑。認知心理學家認為,實現這一目標的基礎是理解直觀物理。然而,這些模型準確表示直觀物理的能力仍然大部分未被探索。為彌合這一差距,我們引入了 PhyGenBench,一個全面的物理生成基準,旨在評估 T2V 生成中的物理常識正確性。PhyGenBench 包括 160 個精心設計的提示,涵蓋 27 個不同的物理定律,跨越四個基本領域,可以全面評估模型對物理常識的理解。除了 PhyGenBench,我們提出了一個新穎的評估框架,稱為 PhyGenEval。該框架採用分層評估結構,利用適當的先進視覺語言模型和大型語言模型來評估物理常識。通過 PhyGenBench 和 PhyGenEval,我們可以對 T2V 模型對物理常識的理解進行大規模自動評估,這與人類反饋密切一致。我們的評估結果和深入分析表明,當前模型難以生成符合物理常識的視頻。此外,僅僅擴大模型規模或應用提示工程技術是不足以完全應對 PhyGenBench 提出的挑戰(例如,動態情景)。我們希望這項研究能激勵社群將物理常識的學習置於這些模型中娛樂應用之外的重要位置。我們將在 https://github.com/OpenGVLab/PhyGenBench 上發布數據和代碼。
English
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive Physics Generation Benchmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench
PDF463November 16, 2024