Physics-IQ確認済み

要旨

ビデオ生成モデル（VGM）は、動画生成だけでなく、世界モデリングを含む多様なダウンストリームタスクにも利用できる新たなフロンティアとなっています。これらのタスクを発展させるためには、優れたビデオモデルが世界の物理的現実を理解していなければなりません。この理解を評価することは新たな分野であり、モデルが生成した動画と物理実験の実世界動画を比較することで、それを明示的に定量化するPhysics-IQベンチマークが開発されました。本研究では、Physics-IQベンチマークの体系的な検証を行い、その欠点を明らかにし、VGMの物理的理解をより正確に測定するための3つの解決策を提案します。具体的には、プロンプトと正解データの品質を向上させて交絡因子の影響を低減し、さらに各サンプルと各指標を均等に重み付けするサンプルレベルのスコアリングシステムを導入します。その結果得られたベンチマーク「Physics-IQ Verified」は、全サンプルの57.6％を改良し、34.8％以上のプロンプトを改善しました。6つの画像-to-動画生成モデルを用いた比較研究では、中程度ながら有意義なランキングの変化（Kendallのτ = 0.46）が観察されました。Physics-IQ Verifiedが、物理的に正確なVGMに向けたより信頼性の高いシグナルを提供し、コミュニティの発展に貢献することを期待しています。ベンチマークのコードは https://github.com/google-deepmind/physics-iq-benchmark から入手できます。

English

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark