基於視頻先驗的統一四維世界動作建模與非同步去噪注：此標題翻譯保持了原文的技術術語準確性（如"4D World Action Modeling"譯為"四維世界動作建模"，"Asynchronous Denoising"譯為"非同步去噪"），同時採用中文論文標題常見的"基於...的..."結構增強學術性。通過添加"與"字自然連接兩個核心概念，符合中文標題的流暢性要求。

摘要

我們提出X-WAM——一個統一4D世界模型，通過單一框架實現實時機器人動作執行與高保真4D世界合成（視頻+3D重建）的統一，克服了先前統一世界模型（如UWM）僅建模二維像素空間且難以平衡動作效率與世界建模品質的關鍵侷限。為利用預訓練視頻擴散模型的強大視覺先驗，X-WAM通過預測多視角RGB-D視頻來構想未來世界，並通過輕量級結構適配高效獲取空間信息：將預訓練擴散變換器的最後幾個模塊複製到專用深度預測分支中，用於重建未來空間信息。此外，我們提出異步噪聲採樣（ANS）來聯合優化生成品質與動作解碼效率。ANS在推理階段採用專用的異步去噪調度策略，以更少步數快速解碼動作實現高效實時執行，同時保留完整步數序列生成高保真視頻。ANS並非在訓練中完全解耦時間步，而是從其聯合分佈中採樣以對齊推理分佈。基於超過5,800小時機器人數據的預訓練，X-WAM在RoboCasa和RoboTwin 2.0基準測試中分別達到79.2%和90.7%的平均成功率，其產出的高保真4D重建與生成結果在視覺和幾何指標上均超越現有方法。

English

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

摘要

Support