LaWAM: 効率的なダイナミクス認識ロボットポリシーのための潜在世界行動モデル

要旨

視覚言語行動モデル（VLA）は大規模な視覚言語事前学習を活用して意味的なロボット制御を実現するが、ロボットの動作がシーンをどのように変化させるかについて明示的な予見を欠くことが多い。World-Action Model（WAM）は、予測された未来に基づいてポリシーを条件付けすることでこの制限に対処するが、既存の手法は通常、画素レベルの冗長性が大きい計算コストの高いビデオ生成に依存している。我々はLaWAM（潜在世界行動モデル）を提案する。これは、再構成された将来のビデオではなく、コンパクトな潜在視覚サブゴールを通じて、予測ダイナミクスをロボットポリシーに曝露する。LaWAMの中核は、潜在動作条件付き潜在世界モデル（LaWM）である。LaWMは、事前学習された視覚基盤モデルの潜在空間において潜在動作モデルを訓練し、その前方デコーダを再利用して将来の観測特徴を予測しシーンの進化を捉えることで得られる。そしてLaWAMは、これらの予測された潜在視覚サブゴールに動作生成を条件付けすることで、ダイナミクスを考慮したロボット制御を実現する。LaWAMは、LIBERO（成功率98.6%）、RoboTwin（成功率91.22%）、および実世界の操作タスクにおいて、低レイテンシ推論を維持しながら、最先端あるいは競争力のある成功率を達成する。LaWAMはアクションチャンク予測あたり187ミリ秒で動作し、ピクセル空間のWAMと比較して最大24倍の壁時計レイテンシ低減を実現する。

English

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.