RoboScape:物理信息驱动的具身世界模型
RoboScape: Physics-informed Embodied World Model
June 29, 2025
作者: Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, Yong Li
cs.AI
摘要
世界模型已成為具身智能不可或缺的工具,作為強大的模擬器,能夠生成逼真的機器人視頻,同時應對關鍵的數據稀缺挑戰。然而,當前的具身世界模型在物理感知方面表現有限,特別是在建模三維幾何和運動動力學方面,導致在接觸密集的機器人場景中生成不真實的視頻。本文提出RoboScape,一個統一的物理信息世界模型,在集成框架內聯合學習RGB視頻生成和物理知識。我們引入了兩個關鍵的物理信息聯合訓練任務:時間深度預測,增強視頻渲染中的三維幾何一致性;以及關鍵點動力學學習,隱式編碼物理屬性(如物體形狀和材料特性),同時改進複雜運動建模。大量實驗表明,RoboScape在各種機器人場景中生成具有卓越視覺逼真度和物理合理性的視頻。我們通過下游應用進一步驗證其實用性,包括使用生成數據進行機器人策略訓練和策略評估。我們的工作為構建高效的物理信息世界模型以推進具身智能研究提供了新的見解。代碼可在以下網址獲取:https://github.com/tsinghua-fib-lab/RoboScape。
English
World models have become indispensable tools for embodied intelligence,
serving as powerful simulators capable of generating realistic robotic videos
while addressing critical data scarcity challenges. However, current embodied
world models exhibit limited physical awareness, particularly in modeling 3D
geometry and motion dynamics, resulting in unrealistic video generation for
contact-rich robotic scenarios. In this paper, we present RoboScape, a unified
physics-informed world model that jointly learns RGB video generation and
physics knowledge within an integrated framework. We introduce two key
physics-informed joint training tasks: temporal depth prediction that enhances
3D geometric consistency in video rendering, and keypoint dynamics learning
that implicitly encodes physical properties (e.g., object shape and material
characteristics) while improving complex motion modeling. Extensive experiments
demonstrate that RoboScape generates videos with superior visual fidelity and
physical plausibility across diverse robotic scenarios. We further validate its
practical utility through downstream applications including robotic policy
training with generated data and policy evaluation. Our work provides new
insights for building efficient physics-informed world models to advance
embodied intelligence research. The code is available at:
https://github.com/tsinghua-fib-lab/RoboScape.