StaMo: 컴팩트한 상태 표현에서 일반화 가능한 로봇 동작의 비지도 학습

초록

구현된 지능(embodied intelligence)에서의 근본적인 과제는 효율적인 세계 모델링과 의사결정을 위한 표현력이 풍부하면서도 간결한 상태 표현을 개발하는 것입니다. 그러나 기존 방법들은 이러한 균형을 달성하지 못해, 과도하게 중복되거나 작업에 필수적인 정보가 부족한 표현을 생성하는 경우가 많습니다. 우리는 강력한 생성적 사전 지식을 활용한 경량 인코더와 사전 학습된 Diffusion Transformer(DiT) 디코더를 사용하여 고도로 압축된 두 토큰 상태 표현을 학습하는 비지도 접근 방식을 제안합니다. 우리의 표현은 효율적이고 해석 가능하며, 기존 VLA 기반 모델에 원활하게 통합되어 LIBERO에서 14.3%, 실제 작업 성공률에서 30%의 성능 향상을 달성하면서도 최소한의 추론 오버헤드를 유지합니다. 더 중요한 것은, 잠재 보간(latent interpolation)을 통해 얻은 이 토큰들 간의 차이가 자연스럽게 매우 효과적인 잠재 행동(latent action)으로 작용하며, 이는 실행 가능한 로봇 동작으로 추가 디코딩될 수 있다는 점입니다. 이러한 자발적 능력은 우리의 표현이 명시적인 지도 없이도 구조화된 동역학을 포착한다는 것을 보여줍니다. 우리는 이 방법을 정적 이미지에서 인코딩된 간결한 상태 표현으로부터 일반화 가능한 로봇 동작(Motion)을 학습하는 능력 때문에 StaMo라고 명명하며, 이는 복잡한 아키텍처와 비디오 데이터에 의존하는 잠재 행동 학습의 일반적인 접근 방식에 도전합니다. 결과적으로 얻은 잠재 행동은 정책 공동 학습(policy co-training)을 강화하여 기존 방법보다 10.4% 우수한 성능을 보이면서도 해석 가능성을 개선합니다. 또한, 우리의 접근 방식은 실제 로봇 데이터, 시뮬레이션, 인간 중심 비디오 등 다양한 데이터 소스에서 효과적으로 확장됩니다.

English

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

StaMo: 컴팩트한 상태 표현에서 일반화 가능한 로봇 동작의 비지도 학습

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

초록

Support