LaWAM：面向高效动力学感知机器人策略的潜在世界动作模型

摘要

视觉-语言-动作模型（VLAs）利用大规模视觉-语言预训练实现语义化机器人控制，但往往缺乏对机器人动作如何改变场景的显式预见。世界-动作模型（WAMs）通过基于预测的未来状态来调节策略，弥补了这一局限，然而现有方法通常依赖计算开销巨大的视频生成过程，其中包含大量像素级冗余。我们提出LaWAM，一种隐空间世界动作模型，通过紧凑的隐空间视觉子目标而非重构的未来视频，将预测性动态信息暴露给机器人策略。LaWAM的核心是一个基于隐动作条件训练的隐空间世界模型（LaWM）。我们通过在预训练的视觉基础模型的隐空间中训练一个隐动作模型，并复用其前向解码器来预测未来观测特征以模拟场景演进，从而得到LaWM。随后，LaWAM基于这些预测的隐空间视觉子目标来生成动作，实现具有动态感知能力的机器人控制。LaWAM在LIBERO（成功率98.6%）、RoboTwin（成功率91.22%）以及真实世界操作任务中取得了最先进或具有竞争力的成功率，同时保持了低延迟推理。LaWAM每次动作块预测仅需187毫秒，相比像素空间WAMs实现了高达24倍的挂钟延迟降低。

English

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.