WoW：透過具身互動邁向全知世界模型

摘要

人類通過與世界的積極互動來發展對直觀物理的理解。這種方法與當前如Sora等視頻模型形成鮮明對比，後者依賴於被動觀察，因此在把握物理因果關係方面存在困難。這一觀察引出了我們的核心假設：世界模型中的真實物理直覺必須基於與現實世界廣泛且因果豐富的互動。為驗證這一假設，我們提出了WoW，這是一個擁有140億參數的生成式世界模型，訓練於200萬條機器人互動軌跡之上。我們的研究發現，模型對物理的理解是可能結果的概率分佈，這導致了隨機不穩定性和物理幻覺。此外，我們展示了通過SOPHIA可以主動約束這種新興能力，使其趨向物理真實性，其中視覺-語言模型代理評估DiT生成的輸出，並通過迭代演進語言指令來指導其改進。此外，一個共同訓練的逆動力學模型將這些精煉的計劃轉化為可執行的機器人動作，從而閉合了從想象到行動的循環。我們建立了WoWBench，這是一個專注於視頻中物理一致性和因果推理的新基準，WoW在其中無論是在人類還是自動評估中都達到了最先進的性能，展現了在物理因果性、碰撞動力學和物體恆常性方面的強大能力。我們的工作提供了系統性證據，表明大規模的現實世界互動是發展AI物理直覺的基石。模型、數據和基準將被開源。

English

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

WoW：透過具身互動邁向全知世界模型

WoW: Towards a World omniscient World model Through Embodied Interaction

摘要

Support