WoW:通过具身交互构建全知世界模型
WoW: Towards a World omniscient World model Through Embodied Interaction
September 26, 2025
作者: Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang
cs.AI
摘要
人类通过与世界的主动互动发展出对直觉物理的理解。这一方式与当前视频模型(如Sora)形成鲜明对比,后者依赖被动观察,因而难以把握物理因果关系。这一观察引出了我们的核心假设:世界模型真实的物理直觉必须建立在与现实世界广泛且因果丰富的互动基础之上。为验证这一假设,我们提出了WoW,一个拥有140亿参数的生成式世界模型,该模型基于200万条机器人交互轨迹进行训练。我们的研究发现,模型对物理的理解表现为可能结果的概率分布,这导致了随机不稳定性和物理幻觉。进一步,我们展示了通过SOPHIA,这种新兴能力能够被主动约束以实现物理真实性,其中视觉-语言模型代理评估DiT生成的输出,并通过迭代演化语言指令来指导其优化。此外,一个共同训练的反向动力学模型将这些优化后的计划转化为可执行的机器人动作,从而闭合了从想象到行动的循环。我们建立了WoWBench,一个专注于视频中物理一致性和因果推理的新基准,WoW在人类和自动化评估中均达到了最先进的性能,展现了在物理因果性、碰撞动力学和物体持久性方面的强大能力。我们的工作系统性地证明了大规模现实世界互动是发展AI物理直觉的基石。模型、数据和基准将全部开源。
English
Humans develop an understanding of intuitive physics through active
interaction with the world. This approach is in stark contrast to current video
models, such as Sora, which rely on passive observation and therefore struggle
with grasping physical causality. This observation leads to our central
hypothesis: authentic physical intuition of the world model must be grounded in
extensive, causally rich interactions with the real world. To test this
hypothesis, we present WoW, a 14-billion-parameter generative world model
trained on 2 million robot interaction trajectories. Our findings reveal that
the model's understanding of physics is a probabilistic distribution of
plausible outcomes, leading to stochastic instabilities and physical
hallucinations. Furthermore, we demonstrate that this emergent capability can
be actively constrained toward physical realism by SOPHIA, where
vision-language model agents evaluate the DiT-generated output and guide its
refinement by iteratively evolving the language instructions. In addition, a
co-trained Inverse Dynamics Model translates these refined plans into
executable robotic actions, thus closing the imagination-to-action loop. We
establish WoWBench, a new benchmark focused on physical consistency and causal
reasoning in video, where WoW achieves state-of-the-art performance in both
human and autonomous evaluation, demonstrating strong ability in physical
causality, collision dynamics, and object permanence. Our work provides
systematic evidence that large-scale, real-world interaction is a cornerstone
for developing physical intuition in AI. Models, data, and benchmarks will be
open-sourced.