WoW：通过具身交互构建全知世界模型

摘要

人类通过与世界的主动互动发展出对直觉物理的理解。这一方式与当前视频模型（如Sora）形成鲜明对比，后者依赖被动观察，因而难以把握物理因果关系。这一观察引出了我们的核心假设：世界模型真实的物理直觉必须建立在与现实世界广泛且因果丰富的互动基础之上。为验证这一假设，我们提出了WoW，一个拥有140亿参数的生成式世界模型，该模型基于200万条机器人交互轨迹进行训练。我们的研究发现，模型对物理的理解表现为可能结果的概率分布，这导致了随机不稳定性和物理幻觉。进一步，我们展示了通过SOPHIA，这种新兴能力能够被主动约束以实现物理真实性，其中视觉-语言模型代理评估DiT生成的输出，并通过迭代演化语言指令来指导其优化。此外，一个共同训练的反向动力学模型将这些优化后的计划转化为可执行的机器人动作，从而闭合了从想象到行动的循环。我们建立了WoWBench，一个专注于视频中物理一致性和因果推理的新基准，WoW在人类和自动化评估中均达到了最先进的性能，展现了在物理因果性、碰撞动力学和物体持久性方面的强大能力。我们的工作系统性地证明了大规模现实世界互动是发展AI物理直觉的基石。模型、数据和基准将全部开源。

English

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

WoW：通过具身交互构建全知世界模型

WoW: Towards a World omniscient World model Through Embodied Interaction

摘要

Support