LikePhys:通過似然偏好評估視頻擴散模型中的直覺物理理解
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
October 13, 2025
作者: Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini
cs.AI
摘要
視頻擴散模型中的直覺物理理解在構建通用且物理合理的世界模擬器中扮演著至關重要的角色,然而,由於在生成過程中難以將物理正確性與視覺外觀分離,準確評估此類能力仍是一項具有挑戰性的任務。為此,我們引入了LikePhys,這是一種無需訓練的方法,通過在精心策劃的有效-無效視頻對數據集上,利用去噪目標作為基於ELBO的似然替代,來評估視頻擴散模型中的直覺物理理解,從而區分物理上有效與不可能的視頻。通過在我們構建的涵蓋四個物理領域的十二種場景基準上進行測試,我們展示了我們的評估指標——合理性偏好誤差(PPE),與人類偏好表現出高度一致性,超越了現有的最先進評估基準。隨後,我們系統性地對當前視頻擴散模型中的直覺物理理解進行了基準測試。我們的研究進一步分析了模型設計和推理設置如何影響直覺物理理解,並強調了不同物理定律間領域特定能力的差異。實證結果表明,儘管當前模型在處理複雜和混沌動力學方面仍存在困難,但隨著模型能力和推理設置的擴展,物理理解能力呈現出明顯的提升趨勢。
English
Intuitive physics understanding in video diffusion models plays an essential
role in building general-purpose physically plausible world simulators, yet
accurately evaluating such capacity remains a challenging task due to the
difficulty in disentangling physics correctness from visual appearance in
generation. To the end, we introduce LikePhys, a training-free method that
evaluates intuitive physics in video diffusion models by distinguishing
physically valid and impossible videos using the denoising objective as an
ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By
testing on our constructed benchmark of twelve scenarios spanning over four
physics domains, we show that our evaluation metric, Plausibility Preference
Error (PPE), demonstrates strong alignment with human preference, outperforming
state-of-the-art evaluator baselines. We then systematically benchmark
intuitive physics understanding in current video diffusion models. Our study
further analyses how model design and inference settings affect intuitive
physics understanding and highlights domain-specific capacity variations across
physical laws. Empirical results show that, despite current models struggling
with complex and chaotic dynamics, there is a clear trend of improvement in
physics understanding as model capacity and inference settings scale.