APT: 액션 전문가 사전 학습을 통한 시각-언어-행동 정책의 명령 일반화 향상

초록

시각-언어-동작(VLA) 모델은 사전 학습된 시각-언어 모델(VLM)을 연속 동작 전문가(continuous action expert)와 결합하여 강력한 조작 성능을 달성했지만, 분포 외(OOD) 언어 명령에 대한 일반화 성능은 여전히 낮다. 알려진 과제 중 하나는 VLA 데이터의 구조적 불균형으로, 언어가 시각 및 동작 콘텐츠보다 훨씬 덜 다양하여 정책이 시각적 지름길(visual shortcuts)에 취약해진다는 점이다. 이산 동작 방법(discrete-action methods)은 시각-언어 공동 학습을 통해 이를 완화하지만, 연속 동작 전문가는 이러한 보호 장치가 부족하다. 즉, 무작위 초기화에서 시작하여 불균형 데이터로부터 전적으로 학습하며, 이로 인해 생성된 노이즈가 많은 그래디언트가 VLM을 손상시키고 언어 능력을 활용하지 못하게 된다. 우리는 이 문제를 베이지안 관점에서 접근하여, 정책을 언어와 무관한 시각-동작(VA) 사전(prior)과 언어 조건부 VLA 가능도(likelihood)로 분해하고, 동작 전문가 사전 학습(Action expert PreTraining)을 강조하는 2단계 훈련 방법인 APT를 제안한다. 1단계에서는 동작 전문가를 고정된 VLM의 시각-동작 쌍을 기반으로 VA 사전으로 사전 학습하여 언어 불균형을 우회한다. 2단계에서는 학습된 시각-운동 사전(visuomotor prior)을 유지하면서 VLM 특징을 통합하는 게이티드 융합 메커니즘(gated fusion mechanism)을 통해 언어 토큰을 주입한다. APT는 π 및 GR00T 스타일 아키텍처를 포함한 주류 VLA 아키텍처에 적용 가능하다. 포괄적인 실험을 통해 APT가 보이지 않는 명령 및 구성적 과제에서 일관된 성능 향상을 달성함을 검증하였다. 프로젝트 페이지: https://xukechun.github.io/papers/APT/

English

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/