LaWAM: 효율적인 동역학 인식 로봇 정책을 위한 잠재 세계 행동 모델

초록

Vision-Language-Action 모델(VLA)은 의미론적 로봇 제어를 위해 대규모 시각-언어 사전학습을 활용하지만, 로봇 동작이 장면을 어떻게 변화시키는지에 대한 명시적인 예측 능력이 부족한 경우가 많다. World-Action Model(WAM)은 예측된 미래에 정책을 조건화함으로써 이러한 한계를 해결하지만, 기존 접근법은 일반적으로 상당한 픽셀 수준의 중복성을 가진 계산 비용이 많이 드는 비디오 생성에 의존한다. 본 논문에서는 재구성된 미래 비디오 대신 간결한 잠재 시각적 하위 목표를 통해 로봇 정책에 예측 역학을 제공하는 잠재 세계 행동 모델인 LaWAM을 제안한다. LaWAM의 핵심은 잠재 동작 조건화된 잠재 세계 모델(LaWM)이다. LaWM은 사전학습된 시각 기초 모델의 잠재 공간에서 잠재 동작 모델을 훈련하고, 그 순방향 디코더를 재활용하여 장면 진화를 위한 미래 관측 특징을 예측함으로써 얻는다. 그런 다음 LaWAM은 이러한 예측된 잠재 시각적 하위 목표에 동작 생성을 조건화하여 역학 인식 로봇 제어를 가능하게 한다. LaWAM은 LIBERO(98.6% 성공률), RoboTwin(91.22% 성공률) 및 실제 조작 작업에서 지연 시간이 짧은 추론을 유지하면서 최첨단 또는 경쟁력 있는 성공률(SR)을 달성한다. LaWAM은 동작 청크 예측당 187ms로 실행되며, 픽셀 공간 WAM보다 최대 24배 낮은 벽시계 지연 시간을 달성한다.

English

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.