机器人控制中的因果世界建模
Causal World Modeling for Robot Control
January 29, 2026
作者: Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, Yinghao Xu
cs.AI
摘要
本研究指出,视频世界建模与视觉语言预训练相结合,为机器人学习建立了全新且独立的基础框架。直观来看,视频世界模型通过理解动作与视觉动态之间的因果关系,赋予了预测近期未来的能力。受此启发,我们提出LingBot-VA——一种同步学习帧预测与策略执行的自回归扩散框架。该模型具备三项精心设计:(1)基于混合变换器架构的共享潜空间,整合视觉与动作标记;(2)支持持续获取环境真实观测反馈的闭环推演机制;(3)并行化动作预测与运动执行的异步推理管道,以实现高效控制。我们在仿真基准测试和真实场景中评估模型,结果表明其在长周期操作、训练后数据效率以及对新配置的强泛化性方面展现出显著潜力。代码与模型已开源以促进学界研究。
English
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.