villa-X: 비전-언어-액션 모델에서의 잠재 액션 모델링 강화

초록

시각-언어-행동(Visual-Language-Action, VLA) 모델은 언어 지시를 따르고 새로운 시나리오에 일반화할 수 있는 로봇 조작 정책을 학습하기 위한 인기 있는 패러다임으로 부상했습니다. 최근 연구에서는 두 프레임 간의 시각적 변화를 추상적으로 표현하는 잠재 행동(latent action)을 VLA 사전 학습에 통합하는 방식을 탐구하기 시작했습니다. 본 논문에서는 일반화 가능한 로봇 조작 정책 학습을 위한 잠재 행동 모델링을 발전시킨 새로운 시각-언어-잠재-행동(Visual-Language-Latent-Action, ViLLA) 프레임워크인 villa-X를 소개합니다. 우리의 접근 방식은 잠재 행동이 학습되는 방식과 이를 VLA 사전 학습에 통합하는 방식을 모두 개선합니다. 이러한 기여를 통해 villa-X는 SIMPLER 및 LIBERO를 포함한 시뮬레이션 환경과 그리퍼 및 정교한 손 조작을 포함한 두 가지 실제 로봇 설정에서 우수한 성능을 달성할 수 있습니다. 우리는 ViLLA 패러다임이 상당한 잠재력을 가지고 있으며, villa-X가 향후 연구를 위한 견고한 기반을 제공한다고 믿습니다.

English

Visual-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent work has begun to explore the incorporation of latent actions, an abstract representation of visual change between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. Together, these contributions enable villa-X to achieve superior performance across simulated environments including SIMPLER and LIBERO, as well as on two real-world robot setups including gripper and dexterous hand manipulation. We believe the ViLLA paradigm holds significant promise, and that our villa-X provides a strong foundation for future research.

villa-X: 비전-언어-액션 모델에서의 잠재 액션 모델링 강화

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

초록

Support