TesserAct: 4D 체화된 세계 모델 학습

초록

본 논문은 3D 장면의 시간에 따른 동적 변화를 구현 에이전트의 행동에 반응하여 예측하며, 공간적 및 시간적 일관성을 제공하는 새로운 4D 구현 세계 모델을 학습하기 위한 효과적인 접근 방식을 제시합니다. 우리는 RGB-DN(RGB, 깊이, 법선) 비디오를 학습하여 4D 세계 모델을 학습하는 방법을 제안합니다. 이 방법은 단순히 기존의 2D 모델을 넘어서서 예측에 세부적인 형태, 구성 및 시간적 변화를 통합할 뿐만 아니라, 구현 에이전트를 위한 정확한 역동적 모델을 효과적으로 학습할 수 있게 합니다. 구체적으로, 우리는 먼저 기존의 로봇 조작 비디오 데이터셋에 깊이와 법선 정보를 추가하여 오프더셸프 모델을 활용합니다. 다음으로, 이 주석이 달린 데이터셋에서 비디오 생성 모델을 미세 조정하여 각 프레임에 대한 RGB-DN(RGB, 깊이, 법선)을 함께 예측합니다. 그런 다음, 생성된 RGB, 깊이, 법선 비디오를 고품질의 4D 장면으로 직접 변환하는 알고리즘을 제시합니다. 우리의 방법은 구현 시나리오에서 4D 장면 예측의 시간적 및 공간적 일관성을 보장하며, 구현 환경을 위한 새로운 시점 합성을 가능하게 하고, 기존의 비디오 기반 세계 모델에서 파생된 것보다 훨씬 우수한 정책 학습을 촉진합니다.

English

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

TesserAct: 4D 체화된 세계 모델 학습

TesserAct: Learning 4D Embodied World Models

초록

Support