통합 비전-언어-행동 모델

초록

비전-언어-행동 모델(VLAs)은 로봇 조작 기술 발전에 있어 그 잠재력으로 인해 상당한 주목을 받고 있습니다. 그러나 기존의 접근 방식들은 주로 비전-언어 모델(VLMs)의 일반적인 이해 능력에 의존하여 행동 신호를 생성하는 데 치중함으로써, 시각적 관찰에 내재된 풍부한 시간적 및 인과적 구조를 간과하는 경향이 있었습니다. 본 논문에서는 비전, 언어, 행동 신호를 이산적 토큰 시퀀스로 자동회귀적으로 모델링하는 통합적이고 본질적인 다중 모달 VLA 모델인 UniVLA를 제안합니다. 이와 같은 형식화는 특히 대규모 비디오 데이터로부터 유연한 다중 모달 작업 학습을 가능하게 합니다. 사후 학습 과정에서 세계 모델링을 통합함으로써, UniVLA는 비디오로부터 인과적 역학을 포착하여, 특히 장기적 작업에 대한 하위 정책 학습으로의 효과적인 전이를 용이하게 합니다. 우리의 접근 방식은 CALVIN, LIBERO, Simplenv-Bridge를 포함한 여러 널리 사용되는 시뮬레이션 벤치마크에서 새로운 최첨단 결과를 달성하며, 기존 방법들을 크게 능가합니다. 예를 들어, UniVLA는 LIBERO 벤치마크에서 95.5%의 평균 성공률을 달성하여 pi0-FAST의 85.5%를 상회합니다. 또한, 실제 세계의 ALOHA 조작 및 자율 주행에서의 광범위한 적용 가능성을 추가로 입증합니다.

English

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

통합 비전-언어-행동 모델

Unified Vision-Language-Action Model

초록

Support