TesserAct:学习四维具身世界模型
TesserAct: Learning 4D Embodied World Models
April 29, 2025
作者: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
cs.AI
摘要
本文提出了一种有效的方法,用于学习新颖的4D具身世界模型,这些模型能够预测3D场景在具身代理动作影响下的动态演变,同时确保空间和时间的一致性。我们建议通过训练RGB-DN(RGB、深度和法线)视频来学习4D世界模型。这种方法不仅超越了传统的2D模型,将详细的形状、配置和时间变化纳入预测之中,还使我们能够有效地学习具身代理的精确逆动力学模型。具体而言,我们首先利用现成模型扩展现有的机器人操作视频数据集,添加深度和法线信息。接着,我们在这个标注数据集上微调视频生成模型,该模型联合预测每一帧的RGB-DN(RGB、深度和法线)。然后,我们提出了一种算法,直接将生成的RGB、深度和法线视频转换为高质量的4D世界场景。我们的方法确保了从具身场景中预测的4D场景在时间和空间上的连贯性,支持具身环境中的新视角合成,并促进了策略学习,其性能显著优于基于先前视频世界模型的方法。
English
This paper presents an effective approach for learning novel 4D embodied
world models, which predict the dynamic evolution of 3D scenes over time in
response to an embodied agent's actions, providing both spatial and temporal
consistency. We propose to learn a 4D world model by training on RGB-DN (RGB,
Depth, and Normal) videos. This not only surpasses traditional 2D models by
incorporating detailed shape, configuration, and temporal changes into their
predictions, but also allows us to effectively learn accurate inverse dynamic
models for an embodied agent. Specifically, we first extend existing robotic
manipulation video datasets with depth and normal information leveraging
off-the-shelf models. Next, we fine-tune a video generation model on this
annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for
each frame. We then present an algorithm to directly convert generated RGB,
Depth, and Normal videos into a high-quality 4D scene of the world. Our method
ensures temporal and spatial coherence in 4D scene predictions from embodied
scenarios, enables novel view synthesis for embodied environments, and
facilitates policy learning that significantly outperforms those derived from
prior video-based world models.Summary
AI-Generated Summary