비디오 사전 정보를 활용한 비동기적 노이즈 제거 기반의 통합 4D 세계 행동 모델링

초록

본 논문에서는 단일 프레임워크 내에서 실시간 로봇 행동 실행과 고품질 4D 세계 합성(비디오 + 3D 재구성)을 통합한 X-WAM(Unified 4D World Model)을 제안한다. 이는 2D 픽셀 공간만을 모델링하고 행동 효율성과 세계 모델링 품질의 균형에 실패한 기존 통합 세계 모델(예: UWM)의 한계를 해결한다. 사전 학습된 비디오 확산 모델의 강력한 시각적 사전 지식을 활용하기 위해, X-WAM은 다중 시점 RGB-D 비디오를 예측하여 미래 세계를 예측하고, 경량 구조적 적응 방식을 통해 공간 정보를 효율적으로 획득한다. 이 방식은 사전 학습된 Diffusion Transformer의 최종 블록 몇 개를 전용 깊이 예측 분기로 복제하여 미래 공간 정보를 재구성한다. 더 나아가 생성 품질과 행동 디코딩 효율성을 공동으로 최적화하기 위해 ANS(Asynchronous Noise Sampling)를 제안한다. ANS는 추론 과정에서 특화된 비동기 노이즈 제거 스케줄을 적용하여, 더 적은 단계로 행동을 빠르게 디코딩하여 실시간 효율적 실행을 가능하게 하면서도 모든 단계를 전담하여 고품질 비디오를 생성한다. ANS는 학습 중 타임스텝을 완전히 분리하기보다 이들의 결합 분포에서 샘플링하여 추론 분포와 정렬한다. 5,800시간 이상의 로봇 데이터로 사전 학습된 X-WAM은 RoboCasa와 RoboTwin 2.0 벤치마크에서 각각 79.2%, 90.7%의 평균 성공률을 달성했으며, 시각적 및 기하학적 측정 지표 모두에서 기존 방법을 능가하는 고품질 4D 재구성 및 생성을 수행한다.

English

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

비디오 사전 정보를 활용한 비동기적 노이즈 제거 기반의 통합 4D 세계 행동 모델링

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

초록

Support