TesserAct：學習四維具身世界模型

摘要

本文提出了一種有效的學習新型四維具身世界模型的方法，該模型能夠預測三維場景在具身代理行動下的動態演變，同時保證空間和時間的一致性。我們建議通過訓練RGB-DN（RGB、深度和法線）視頻來學習四維世界模型。這種方法不僅超越了傳統的二維模型，將詳細的形狀、配置和時間變化納入其預測中，還使我們能夠有效地學習具身代理的精確逆動力學模型。具體而言，我們首先利用現成模型擴展現有的機器人操作視頻數據集，添加深度和法線信息。接著，我們在這個註釋數據集上微調視頻生成模型，該模型聯合預測每一幀的RGB-DN（RGB、深度和法線）。然後，我們提出了一種算法，直接將生成的RGB、深度和法線視頻轉換為高質量的四維世界場景。我們的方法確保了從具身場景中預測的四維場景在時間和空間上的連貫性，支持具身環境的新視角合成，並促進了策略學習，其性能顯著優於基於先前視頻世界模型所導出的策略。

English

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

TesserAct：學習四維具身世界模型

TesserAct: Learning 4D Embodied World Models

摘要

Support