TesserAct: 4D エンボディード・ワールドモデルの学習

要旨

本論文では、4次元の具現化された世界モデルを学習する効果的なアプローチを提案する。このモデルは、具現化されたエージェントの行動に応答して3Dシーンの時間的進化を予測し、空間的および時間的な一貫性を提供する。我々は、RGB-DN（RGB、深度、法線）ビデオを用いて4次元世界モデルを学習することを提案する。これにより、従来の2次元モデルを超えて、詳細な形状、構成、時間的変化を予測に取り入れるだけでなく、具現化されたエージェントの正確な逆動力学モデルを効果的に学習することが可能となる。具体的には、まず既存のロボット操作ビデオデータセットに、市販のモデルを活用して深度と法線情報を拡張する。次に、この注釈付きデータセットでビデオ生成モデルをファインチューニングし、各フレームのRGB-DN（RGB、深度、法線）を共同で予測する。その後、生成されたRGB、深度、法線ビデオを高品質な4次元シーンに直接変換するアルゴリズムを提示する。本手法は、具現化されたシナリオからの4次元シーン予測において時間的および空間的整合性を保証し、具現化された環境のための新規視点合成を可能にし、従来のビデオベースの世界モデルから導出されたものよりも大幅に優れたポリシー学習を促進する。

English

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

TesserAct: 4D エンボディード・ワールドモデルの学習

TesserAct: Learning 4D Embodied World Models

要旨

Support