世界链:潜在运动中的世界模型思维
Chain of World: World Model Thinking in Latent Motion
March 3, 2026
作者: Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma
cs.AI
摘要
视觉-语言-动作(VLA)模型是实现具身智能的重要路径,但现有模型往往忽略视觉动态背后的预测性与时序因果结构。世界模型类VLA通过预测未来帧来解决这一问题,但会浪费容量重构冗余背景。潜在动作类VLA能紧凑编码帧间转换,却缺乏时序连续的动态建模与世界知识。为突破这些局限,我们提出CoWVLA(世界链VLA),这是一种融合世界模型时序推理与解耦潜在运动表征的"世界链"新范式。首先,采用预训练视频VAE作为潜在运动提取器,将视频片段显式解耦为结构与运动潜变量;接着在预训练阶段,VLA根据指令与初始帧推断连续潜在运动链并预测片段终止帧;最后在协同微调阶段,通过统一自回归解码器联合建模稀疏关键帧与动作序列,实现潜在动态与离散动作预测的对齐。该设计既保留了世界模型的时序推理与世界知识优势,又兼具潜在动作的紧凑性与可解释性,实现了高效视觉运动学习。在机器人仿真基准上的大量实验表明,CoWVLA超越现有世界模型与潜在动作方法,并展现出可观的计算效率,凸显其作为更有效VLA预训练范式的潜力。项目网站详见https://fx-hit.github.io/cowvla-io。
English
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.