비전-언어-액션 모델에 대한 조사: 액션 토큰화 관점에서

초록

비전과 언어 기반 모델의 다중모달 이해, 추론 및 생성 분야에서의 놀라운 발전은 이러한 지능을 물리적 세계로 확장하려는 노력을 촉발시켜, 비전-언어-행동(VLA) 모델의 급속한 성장을 이끌고 있습니다. 겉보기에는 다양한 접근 방식이 존재하지만, 현재의 VLA 모델들은 단일 프레임워크로 통합될 수 있음을 관찰했습니다: 비전과 언어 입력은 일련의 VLA 모듈에 의해 처리되며, 점점 더 구체적이고 실행 가능한 정보를 인코딩하는 일련의 액션 토큰을 생성하여 최종적으로 실행 가능한 행동을 생성합니다. 우리는 VLA 모델을 구분하는 주요 설계 선택이 액션 토큰이 어떻게 형성되는지에 있다고 판단했으며, 이를 언어 설명, 코드, 어포던스, 궤적, 목표 상태, 잠재 표현, 원시 행동, 추론 등으로 분류할 수 있습니다. 그러나 액션 토큰에 대한 포괄적인 이해가 여전히 부족하여 효과적인 VLA 개발을 저해하고 미래 방향을 모호하게 만들고 있습니다. 따라서 본 조사는 액션 토큰화의 관점에서 기존 VLA 연구를 분류하고 해석하며, 각 토큰 유형의 강점과 한계를 도출하고 개선할 부분을 식별하는 것을 목표로 합니다. 이 체계적인 리뷰와 분석을 통해 우리는 VLA 모델의 더 넓은 진화에 대한 종합적인 전망을 제시하고, 아직 충분히 탐구되지 않았지만 유망한 방향을 강조하며, 미래 연구를 위한 지침을 제공함으로써 이 분야가 범용 지능에 한 걸음 더 다가가기를 기대합니다.

English

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

비전-언어-액션 모델에 대한 조사: 액션 토큰화 관점에서

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

초록

Support