WoW:透過具身互動邁向全知世界模型
WoW: Towards a World omniscient World model Through Embodied Interaction
September 26, 2025
作者: Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang
cs.AI
摘要
人類通過與世界的積極互動來發展對直觀物理的理解。這種方法與當前如Sora等視頻模型形成鮮明對比,後者依賴於被動觀察,因此在把握物理因果關係方面存在困難。這一觀察引出了我們的核心假設:世界模型中的真實物理直覺必須基於與現實世界廣泛且因果豐富的互動。為驗證這一假設,我們提出了WoW,這是一個擁有140億參數的生成式世界模型,訓練於200萬條機器人互動軌跡之上。我們的研究發現,模型對物理的理解是可能結果的概率分佈,這導致了隨機不穩定性和物理幻覺。此外,我們展示了通過SOPHIA可以主動約束這種新興能力,使其趨向物理真實性,其中視覺-語言模型代理評估DiT生成的輸出,並通過迭代演進語言指令來指導其改進。此外,一個共同訓練的逆動力學模型將這些精煉的計劃轉化為可執行的機器人動作,從而閉合了從想象到行動的循環。我們建立了WoWBench,這是一個專注於視頻中物理一致性和因果推理的新基準,WoW在其中無論是在人類還是自動評估中都達到了最先進的性能,展現了在物理因果性、碰撞動力學和物體恆常性方面的強大能力。我們的工作提供了系統性證據,表明大規模的現實世界互動是發展AI物理直覺的基石。模型、數據和基準將被開源。
English
Humans develop an understanding of intuitive physics through active
interaction with the world. This approach is in stark contrast to current video
models, such as Sora, which rely on passive observation and therefore struggle
with grasping physical causality. This observation leads to our central
hypothesis: authentic physical intuition of the world model must be grounded in
extensive, causally rich interactions with the real world. To test this
hypothesis, we present WoW, a 14-billion-parameter generative world model
trained on 2 million robot interaction trajectories. Our findings reveal that
the model's understanding of physics is a probabilistic distribution of
plausible outcomes, leading to stochastic instabilities and physical
hallucinations. Furthermore, we demonstrate that this emergent capability can
be actively constrained toward physical realism by SOPHIA, where
vision-language model agents evaluate the DiT-generated output and guide its
refinement by iteratively evolving the language instructions. In addition, a
co-trained Inverse Dynamics Model translates these refined plans into
executable robotic actions, thus closing the imagination-to-action loop. We
establish WoWBench, a new benchmark focused on physical consistency and causal
reasoning in video, where WoW achieves state-of-the-art performance in both
human and autonomous evaluation, demonstrating strong ability in physical
causality, collision dynamics, and object permanence. Our work provides
systematic evidence that large-scale, real-world interaction is a cornerstone
for developing physical intuition in AI. Models, data, and benchmarks will be
open-sourced.