基于视频先验的异步去噪统一四维世界动作建模
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
April 29, 2026
作者: Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, Huaping Liu
cs.AI
摘要
我们提出X-WAM——一个统一4D世界模型,该框架将实时机器人动作执行与高保真4D世界合成(视频+3D重建)融为一体,有效解决了先前统一世界模型(如UWM)仅建模二维像素空间、难以平衡动作效率与世界建模质量的关键局限。为利用预训练视频扩散模型的强大视觉先验,X-WAM通过预测多视角RGB-D视频来推演未来世界,并通过轻量级结构适配高效获取空间信息:将预训练扩散Transformer的末端模块复制到专用深度预测分支,以实现未来空间信息的重建。此外,我们提出异步噪声采样(ANS)技术来联合优化生成质量与动作解码效率。ANS在推理阶段采用特制的异步去噪调度,既能通过较少步数快速解码动作以实现高效实时执行,又能保留完整步数序列来生成高保真视频。该技术并非在训练中完全解耦时间步,而是通过从联合分布中采样以保持与推理分布的一致性。基于超过5800小时机器人数据预训练的X-WAM,在RoboCasa和RoboTwin 2.0基准测试中分别达到79.2%和90.7%的平均成功率,其生成的4D重建结果在视觉与几何指标上均超越现有方法。
English
We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.