TesserAct：学习四维具身世界模型

摘要

本文提出了一种有效的方法，用于学习新颖的4D具身世界模型，这些模型能够预测3D场景在具身代理动作影响下的动态演变，同时确保空间和时间的一致性。我们建议通过训练RGB-DN（RGB、深度和法线）视频来学习4D世界模型。这种方法不仅超越了传统的2D模型，将详细的形状、配置和时间变化纳入预测之中，还使我们能够有效地学习具身代理的精确逆动力学模型。具体而言，我们首先利用现成模型扩展现有的机器人操作视频数据集，添加深度和法线信息。接着，我们在这个标注数据集上微调视频生成模型，该模型联合预测每一帧的RGB-DN（RGB、深度和法线）。然后，我们提出了一种算法，直接将生成的RGB、深度和法线视频转换为高质量的4D世界场景。我们的方法确保了从具身场景中预测的4D场景在时间和空间上的连贯性，支持具身环境中的新视角合成，并促进了策略学习，其性能显著优于基于先前视频世界模型的方法。

English

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

TesserAct：学习四维具身世界模型

TesserAct: Learning 4D Embodied World Models

摘要

Support