ChatPaper.aiChatPaper

LikePhys:通过似然偏好评估视频扩散模型中的直观物理理解

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

October 13, 2025
作者: Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini
cs.AI

摘要

视频扩散模型中的直观物理理解在构建通用且物理可信的世界模拟器中起着至关重要的作用,然而,由于在生成过程中难以将物理正确性与视觉表现分离,准确评估这种能力仍是一项挑战。为此,我们提出了LikePhys,一种无需训练的方法,通过在精心策划的有效-无效视频对数据集上,利用去噪目标作为基于ELBO的似然替代,来评估视频扩散模型中的直观物理理解能力。通过在涵盖四个物理领域的十二种场景构建的基准测试中验证,我们的评估指标——可信度偏好误差(PPE)显示出与人类偏好的高度一致性,超越了现有最先进的评估基线。随后,我们系统地对当前视频扩散模型的直观物理理解能力进行了基准测试。我们的研究进一步分析了模型设计和推理设置如何影响直观物理理解,并揭示了跨物理定律的领域特定能力差异。实证结果表明,尽管现有模型在处理复杂和混沌动力学方面存在困难,但随着模型容量和推理设置的扩展,物理理解能力呈现出明显的提升趋势。
English
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
PDF62October 14, 2025