ChatPaper.aiChatPaper

物理智商验证

Physics-IQ Verified

June 17, 2026
作者: Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth
cs.AI

摘要

视频生成模型(VGMs)已成为一个新兴前沿领域,不仅可用于视频生成,还能应用于包括世界建模在内的多种下游任务。为了推进这些任务,一个优秀的视频模型必须理解世界的物理现实。评估这种理解能力是一个新兴领域,并催生了Physics-IQ基准测试——该基准通过将模型生成的视频与真实物理实验视频进行对比,明确量化了这一能力。本文中,我们对Physics-IQ基准进行了系统性审查,揭示了其缺陷,并提出了三项改进方案,以更精准地衡量VGMs的物理理解能力。具体而言,我们优化了提示词和真值质量以减少混淆因素的影响,并引入了一种样本级评分系统,对每个样本和指标赋予同等权重。由此得到的改进版基准Physics-IQ Verified,对57.6%的样本进行了精细化调整,并改进了超过34.8%的提示词。在基于六个图像到视频生成模型的对比研究中,我们观察到了中等但具有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified能通过为符合物理规律的VGMs提供更可靠的信号,推动社区发展。该基准的代码可在 https://github.com/google-deepmind/physics-iq-benchmark 获取。
English
Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark