WoW：具現化されたインタラクションを通じた全知的世界モデルへのアプローチ

要旨

人間は、世界との能動的な相互作用を通じて直感的な物理学の理解を発達させる。このアプローチは、受動的な観察に依存するため物理的因果関係の把握に苦労する現在のビデオモデル（例えばSora）とは大きく異なる。この観察から、我々の中心的な仮説が導かれる：世界モデルの真の物理的直感は、現実世界との広範で因果関係に富んだ相互作用に基づいていなければならない。この仮説を検証するため、我々はWoWを提示する。これは、200万のロボット相互作用軌跡で訓練された140億パラメータの生成的世界モデルである。我々の研究結果は、モデルの物理学の理解が確率的な結果分布であり、確率的不安定性と物理的幻覚を引き起こすことを明らかにする。さらに、この創発的な能力が、SOPHIAによって物理的リアリズムに向けて積極的に制約できることを示す。ここでは、視覚言語モデルエージェントがDiT生成出力を評価し、言語指示を反復的に進化させることでその洗練を導く。加えて、共訓練された逆動力学モデルがこれらの洗練された計画を実行可能なロボット動作に変換し、想像から行動へのループを閉じる。我々は、物理的一貫性と因果推論に焦点を当てた新しいベンチマークWoWBenchを確立し、WoWが人間と自動評価の両方で最先端の性能を達成し、物理的因果関係、衝突力学、物体の永続性において強い能力を示すことを実証する。我々の研究は、大規模な現実世界の相互作用がAIにおける物理的直感の発展の基盤であることを体系的に示す。モデル、データ、ベンチマークはオープンソース化される。

English

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.