Kinema4D：時空具身模擬的運動學四維世界建模

摘要

模擬機器人與環境的互動是具身人工智慧的基石。近期少數研究展現出利用影片生成技術突破傳統模擬器剛性視覺/物理限制的潛力。然而這些方法主要侷限於二維空間或受靜態環境線索引導，忽略了機器人-環境互動本質上是四維時空事件、需要精確互動建模的基礎現實。為恢復此四維本質同時確保精確的機器人控制，我們提出Kinema4D——新型動作條件式四維生成機器人模擬器，其將機器人-環境互動解耦為：i) 機器人控制的精確四維表徵：透過運動學驅動基於URDF的三維機器人，生成精確的四維機器人控制軌跡；ii) 環境反應的生成式四維建模：將四維機器人軌跡投影為點雲圖的時空視覺信號，控制生成模型將複雜環境的反應動力學合成為同步的RGB/點雲序列。為促進訓練，我們構建大規模數據集Robo4D-200k，包含201,426個具高質量四維標註的機器人互動片段。大量實驗表明，我們的方法能有效模擬物理合理、幾何一致且具身無關的互動，忠實反映多樣化的真實世界動力學。該方法首次展現出潛在的零樣本遷移能力，為推進下一代具身模擬奠定了高擬真基礎。

English

Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

Kinema4D：時空具身模擬的運動學四維世界建模

Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

摘要

Support