机器人控制中的因果世界建模

摘要

本研究指出，视频世界建模与视觉语言预训练相结合，为机器人学习建立了全新且独立的基础。直观而言，视频世界模型通过理解动作与视觉动态之间的因果关系，赋予了预测近期未来的能力。受此启发，我们提出LingBot-VA——一种同时学习帧预测与策略执行的自回归扩散框架。该模型具有三项精心设计：（1）基于混合专家Transformer架构的共享潜空间，可融合视觉与动作表征；（2）支持持续获取环境真实观测数据的闭环推演机制；（3）并行化动作预测与运动执行的异步推理管道，以实现高效控制。我们在仿真基准测试和真实场景中验证了模型性能，结果表明其在长周期操作任务、训练后数据效率以及对新场景配置的强泛化能力方面展现出显著优势。代码与模型已开源以促进学术社区发展。

English

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

机器人控制中的因果世界建模

Causal World Modeling for Robot Control

摘要

Support