RoboScape：物理信息驱动的具身世界模型

摘要

世界模型已成為具身智能不可或缺的工具，作為強大的模擬器，能夠生成逼真的機器人視頻，同時應對關鍵的數據稀缺挑戰。然而，當前的具身世界模型在物理感知方面表現有限，特別是在建模三維幾何和運動動力學方面，導致在接觸密集的機器人場景中生成不真實的視頻。本文提出RoboScape，一個統一的物理信息世界模型，在集成框架內聯合學習RGB視頻生成和物理知識。我們引入了兩個關鍵的物理信息聯合訓練任務：時間深度預測，增強視頻渲染中的三維幾何一致性；以及關鍵點動力學學習，隱式編碼物理屬性（如物體形狀和材料特性），同時改進複雜運動建模。大量實驗表明，RoboScape在各種機器人場景中生成具有卓越視覺逼真度和物理合理性的視頻。我們通過下游應用進一步驗證其實用性，包括使用生成數據進行機器人策略訓練和策略評估。我們的工作為構建高效的物理信息世界模型以推進具身智能研究提供了新的見解。代碼可在以下網址獲取：https://github.com/tsinghua-fib-lab/RoboScape。

English

World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.

RoboScape：物理信息驱动的具身世界模型

RoboScape: Physics-informed Embodied World Model

摘要

Support