Kinema4D:面向时空具身仿真的运动学四维世界建模
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
March 17, 2026
作者: Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu
cs.AI
摘要
模拟机器人-世界交互是具身人工智能的基石。近期少数研究展现出利用视频生成技术突破传统模拟器刚性视觉/物理约束的潜力。然而,这些方法主要基于二维空间或静态环境线索,忽略了机器人-世界交互本质上是需要精确交互建模的四维时空事件。为恢复这种四维本质并确保精确的机器人控制,我们提出Kinema4D——一种新型动作条件化四维生成式机器人模拟器,其将机器人-世界交互解耦为:i)机器人控制的精确四维表征:通过运动学驱动基于URDF的三维机器人,生成精确的四维机器人控制轨迹;ii)环境反应的生成式四维建模:将四维机器人轨迹投影为点云图的时空视觉信号,控制生成模型将复杂环境的反应动力学合成为同步的RGB/点云序列。为促进训练,我们构建了大规模数据集Robo4D-200k,包含201,426个具有高质量四维标注的机器人交互片段。大量实验表明,我们的方法能有效模拟物理合理、几何一致且与具体载体无关的交互行为,精准反映多样化的真实世界动力学特性。该方法首次展现出零样本迁移的潜力,为推进下一代具身模拟技术奠定了高保真基础。
English
Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.