ChatPaper.aiChatPaper

物理智商驗證

Physics-IQ Verified

June 17, 2026
作者: Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth
cs.AI

摘要

視頻生成模型(VGMs)已成為一個新的前沿領域,不僅可用於影片生成,還能應用於包括世界建模在內的多種下游任務。為了推動這些任務的發展,一個優秀的影片模型必須理解世界的物理真實性。評估這種理解能力是一個新興領域,並催生了Physics-IQ基準測試——該基準通過將模型生成的影片與物理實驗的真實影片進行比較,明確量化這種理解。在這項工作中,我們對Physics-IQ基準進行了系統性審查,揭示了其不足之處,並提出了三項解決方案,以強化我們測量VGMs物理理解能力的方式。具體而言,我們改善了提示(prompt)和真實標註(ground-truth)的品質,以減少混淆因素的影響,並進一步引入了一套樣本級評分系統,對每個樣本和指標賦予平等權重。由此產生的新基準——Physics-IQ Verified——對所有樣本中的57.6%進行了優化,並改進了超過34.8%的提示。在對六個圖像到影片生成模型進行的比較研究中,我們觀察到適度但具意義的排名變化(Kendall's τ = 0.46)。我們希望Physics-IQ Verified能透過提供更可靠的訊號,推動社群朝向物理精確的VGMs邁進。該基準的程式碼可於 https://github.com/google-deepmind/physics-iq-benchmark 取得。
English
Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark