物理智商验证

摘要

视频生成模型（VGMs）已成为一个新兴前沿领域，不仅可用于视频生成，还能应用于包括世界建模在内的多种下游任务。为了推进这些任务，一个优秀的视频模型必须理解世界的物理现实。评估这种理解能力是一个新兴领域，并催生了Physics-IQ基准测试——该基准通过将模型生成的视频与真实物理实验视频进行对比，明确量化了这一能力。本文中，我们对Physics-IQ基准进行了系统性审查，揭示了其缺陷，并提出了三项改进方案，以更精准地衡量VGMs的物理理解能力。具体而言，我们优化了提示词和真值质量以减少混淆因素的影响，并引入了一种样本级评分系统，对每个样本和指标赋予同等权重。由此得到的改进版基准Physics-IQ Verified，对57.6%的样本进行了精细化调整，并改进了超过34.8%的提示词。在基于六个图像到视频生成模型的对比研究中，我们观察到了中等但具有意义的排名变化（Kendall's τ=0.46）。我们希望Physics-IQ Verified能通过为符合物理规律的VGMs提供更可靠的信号，推动社区发展。该基准的代码可在 https://github.com/google-deepmind/physics-iq-benchmark 获取。

English

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark