ビデオ生成とワールドモデルの距離：物理法則の観点から

要旨

OpenAIのSoraは、ビデオ生成の可能性を強調し、基本的な物理法則に従うワールドモデルの開発に貢献しています。しかし、ビデオ生成モデルが視覚データだけから人間の先入観なしでこれらの法則を発見する能力は疑問視され得ます。真の法則を学習するワールドモデルは、微妙な点に強い予測を提供し、見慣れないシナリオに正しく外挿するはずです。本研究では、三つの主要シナリオを横断的に評価します：分布内、分布外、および組み合わせ一般化。物体の移動と衝突のための2Dシミュレーションテストベッドを開発し、古典力学の一つ以上の法則によって決定論的に制御されたビデオを生成しました。これにより、大規模な実験のための無制限のデータ供給が可能となり、生成されたビデオが物理法則に従っているかどうかを定量評価することができます。初期フレームに基づいて物体の移動を予測するために拡散ベースのビデオ生成モデルを訓練しました。スケーリング実験では、分布内での完全な一般化、組み合わせ一般化における計測可能なスケーリング動作、しかし分布外シナリオでの失敗が示されました。さらなる実験から、これらのモデルの一般化メカニズムについて二つの重要な洞察が明らかになりました：(1) モデルは一般的な物理法則を抽象化することに失敗し、代わりに「ケースベース」の一般化行動、つまり、最も近い訓練例を模倣することが観察されました；(2) 新しいケースに一般化する際、モデルは訓練データを参照する際に異なる要因を優先することが観察されました：色 > サイズ > 速度 > 形状。私たちの研究は、単独のスケーリングだけでは、ビデオ生成モデルが基本的な物理法則を発見するのに十分ではないことを示唆していますが、これはSoraの広範な成功において果たす役割にもかかわらずです。プロジェクトページはこちらhttps://phyworld.github.io をご覧ください。

English

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io