Physics-IQ 검증됨

초록

비디오 생성 모델(VGM)은 비디오 생성뿐만 아니라 세계 모델링을 포함한 다양한 하위 작업에 활용될 수 있는 새로운 개척 분야가 되었다. 이러한 작업을 발전시키기 위해서는 우수한 비디오 모델이 세계의 물리적 현실을 이해해야 한다. 이러한 이해를 평가하는 것은 떠오르는 연구 분야이며, 물리적 실험에 대한 모델 생성 비디오와 실제 비디오를 비교하여 이를 명시적으로 정량화하는 Physics-IQ 벤치마크로 이어졌다. 본 연구에서는 Physics-IQ 벤치마크에 대한 체계적인 감사를 수행하고, 한계점을 드러내며, VGM의 물리적 이해도를 측정하는 방식을 개선하는 세 가지 해결책을 제안한다. 구체적으로, 프롬프트와 실제 정답의 품질을 개선하여 혼란 변수의 영향을 줄이고, 각 샘플과 지표에 동일한 가중치를 부여하는 샘플 수준 점수 체계를 도입한다. 그 결과 도출된 벤치마크인 Physics-IQ Verified는 전체 샘플의 57.6%를 개선하고 프롬프트의 34.8% 이상을 향상시킨다. 여섯 개의 이미지-투-비디오 생성 모델을 사용한 비교 연구에서, 우리는 완만하지만 의미 있는 순위 변화(켄달의 τ = 0.46)를 관찰했다. Physics-IQ Verified가 물리적으로 정확한 VGM을 향한 보다 신뢰할 수 있는 신호를 제공함으로써 학계 발전에 기여하기를 바란다. 벤치마크 코드는 https://github.com/google-deepmind/physics-iq-benchmark에서 확인할 수 있다.

English

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark