物理智商驗證

摘要

視頻生成模型（VGMs）已成為一個新的前沿領域，不僅可用於影片生成，還能應用於包括世界建模在內的多種下游任務。為了推動這些任務的發展，一個優秀的影片模型必須理解世界的物理真實性。評估這種理解能力是一個新興領域，並催生了Physics-IQ基準測試——該基準通過將模型生成的影片與物理實驗的真實影片進行比較，明確量化這種理解。在這項工作中，我們對Physics-IQ基準進行了系統性審查，揭示了其不足之處，並提出了三項解決方案，以強化我們測量VGMs物理理解能力的方式。具體而言，我們改善了提示（prompt）和真實標註（ground-truth）的品質，以減少混淆因素的影響，並進一步引入了一套樣本級評分系統，對每個樣本和指標賦予平等權重。由此產生的新基準——Physics-IQ Verified——對所有樣本中的57.6%進行了優化，並改進了超過34.8%的提示。在對六個圖像到影片生成模型進行的比較研究中，我們觀察到適度但具意義的排名變化（Kendall's τ = 0.46）。我們希望Physics-IQ Verified能透過提供更可靠的訊號，推動社群朝向物理精確的VGMs邁進。該基準的程式碼可於 https://github.com/google-deepmind/physics-iq-benchmark 取得。

English

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark