映像事前分布による統一的な4次元世界行動モデリングと非同期ノイズ除去このタイトルは、映像データから得られる事前知識を活用し、時間的次元を含む4次元空間における世界の行動を統一的にモデル化する手法を提案する研究を表しています。特に、非同期なノイズ除去技術を用いて、時空間的に不均一なデータを効果的に処理する点が特徴です。

要旨

我々はX-WAMを提案する。これは統一4D世界モデルであり、リアルタイムロボット動作実行と高精細な4D世界合成（ビデオ＋3D再構築）を単一フレームワークに統合したものである。従来の統一世界モデル（例：UWM）が2Dピクセル空間のみをモデル化し、動作効率と世界モデリング品質の両立に失敗していたという重大な限界を解決する。X-WAMは、事前学習済みビデオ拡散モデルの強力な視覚事前知識を活用するため、マルチビューRGB-Dビデオを予測して未来世界を想像し、軽量な構造適応を通じて空間情報を効率的に取得する。具体的には、事前学習済みDiffusion Transformerの最終数ブロックを複製し、専用の深度予測ブランチとして未来の空間情報の再構築を行う。さらに、生成品質と動作デコード効率を共同最適化するため、非同期ノイズサンプリング（ANS）を提案する。ANSは推論時に専門化された非同期デノイジングスケジュールを適用し、より少ないステップで動作を迅速にデコードして効率的なリアルタイム実行を可能にすると同時に、全ステップシーケンスを高精細なビデオ生成に専念させる。学習中にタイムステップを完全に分離するのではなく、ANSはそれらの結合分布からサンプリングし、推論分布との整合を図る。5,800時間以上のロボットデータで事前学習されたX-WAMは、RoboCasaとRoboTwin 2.0ベンチマークでそれぞれ79.2%と90.7%の平均成功率を達成し、視覚的および幾何学的指標の両方で既存手法を凌駕する高精細な4D再構築と生成を実現する。

English

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

要旨

Support