LaWAM：高效動態感知機器人策略的潛在世界行動模型

摘要

視覺-語言-動作模型（VLA）依賴大規模視覺-語言預訓練來實現語義級機器人控制，但通常缺乏對機器人動作如何改變場景的前瞻性洞察。世界-動作模型（WAM）透過根據預測的未來狀態來條件化策略，從而解決此限制，然而現有方法通常依賴於計算成本高昂的影片生成，且存在大量像素級冗餘。我們提出 LaWAM，一種潛在世界動作模型，該模型透過緊湊的潛在視覺子目標而非重建的未來影片，將預測動態資訊暴露給機器人策略。LaWAM 的核心是一個由潛在動作條件化的潛在世界模型（LaWM）。我們透過在預訓練視覺基礎模型的潛在空間中訓練一個潛在動作模型，並重新利用其前向解碼器來預測用於場景演化的未來觀測特徵，從而獲得 LaWM。接著，LaWAM 將動作生成條件化於這些預測的潛在視覺子目標，以實現具動態感知的機器人控制。LaWAM 在 LIBERO（成功率 98.6%）、RoboTwin（成功率 91.22%）以及真實世界操作任務中達到了最先進或具競爭力的成功率，同時保持了低延遲推理。LaWAM 每次動作區塊預測耗時 187 毫秒，且其實際時間延遲比像素空間 WAM 低達 24 倍。

English

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.