TesserAct:學習四維具身世界模型
TesserAct: Learning 4D Embodied World Models
April 29, 2025
作者: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
cs.AI
摘要
本文提出了一種有效的學習新型四維具身世界模型的方法,該模型能夠預測三維場景在具身代理行動下的動態演變,同時保證空間和時間的一致性。我們建議通過訓練RGB-DN(RGB、深度和法線)視頻來學習四維世界模型。這種方法不僅超越了傳統的二維模型,將詳細的形狀、配置和時間變化納入其預測中,還使我們能夠有效地學習具身代理的精確逆動力學模型。具體而言,我們首先利用現成模型擴展現有的機器人操作視頻數據集,添加深度和法線信息。接著,我們在這個註釋數據集上微調視頻生成模型,該模型聯合預測每一幀的RGB-DN(RGB、深度和法線)。然後,我們提出了一種算法,直接將生成的RGB、深度和法線視頻轉換為高質量的四維世界場景。我們的方法確保了從具身場景中預測的四維場景在時間和空間上的連貫性,支持具身環境的新視角合成,並促進了策略學習,其性能顯著優於基於先前視頻世界模型所導出的策略。
English
This paper presents an effective approach for learning novel 4D embodied
world models, which predict the dynamic evolution of 3D scenes over time in
response to an embodied agent's actions, providing both spatial and temporal
consistency. We propose to learn a 4D world model by training on RGB-DN (RGB,
Depth, and Normal) videos. This not only surpasses traditional 2D models by
incorporating detailed shape, configuration, and temporal changes into their
predictions, but also allows us to effectively learn accurate inverse dynamic
models for an embodied agent. Specifically, we first extend existing robotic
manipulation video datasets with depth and normal information leveraging
off-the-shelf models. Next, we fine-tune a video generation model on this
annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for
each frame. We then present an algorithm to directly convert generated RGB,
Depth, and Normal videos into a high-quality 4D scene of the world. Our method
ensures temporal and spatial coherence in 4D scene predictions from embodied
scenarios, enables novel view synthesis for embodied environments, and
facilitates policy learning that significantly outperforms those derived from
prior video-based world models.Summary
AI-Generated Summary