의도에서 실행까지: 시각-언어-행동 모델의 일반화 경계 탐구

초록

비전-언어-행동(Vision-Language-Action, VLA) 모델이 전통적인 로봇 공학의 모방 학습에 비해 가지는 한 가지 약속은, 대규모 비전-언어 모델(Vision-Language Models, VLMs)의 광범위한 일반화 능력을 활용하여 다재다능한 "일반주의" 로봇 정책을 생성하는 것이다. 그러나 현재 VLA에 대한 평가는 여전히 불충분하다. 전통적인 모방 학습 벤치마크는 언어 지시가 부족하기 때문에 적합하지 않다. 언어를 통합한 새로운 VLA 벤치마크는 종종 제한된 평가 작업을 포함하며, VLM 사전 학습이 하위 로봇 정책의 일반화 능력에 실제로 얼마나 기여하는지 조사하려는 의도가 없다. 한편, 많은 연구는 서로 다른 기관에서 독립적으로 설계한 실제 로봇 설정에 의존하고 있어 재현성과 접근성에 장벽을 만든다. 이러한 격차를 해결하기 위해, 우리는 언어 지시, 비전, 물체에 걸친 10개의 하위 범주에 걸친 50개의 시뮬레이션 기반 작업으로 구성된 통합 프로빙 스위트를 소개한다. 우리는 이 스위트를 통해 여러 최신 VLA 아키텍처를 체계적으로 평가하여 그들의 일반화 능력을 이해한다. 우리의 결과는 VLM 백본이 VLA에 강력한 지각 이해와 높은 수준의 계획 능력(우리가 '좋은 의도'라고 부르는)을 부여하지만, 이는 정확한 운동 실행으로 안정적으로 이어지지 않음을 보여준다: 분포 외 관측에 직면했을 때, 정책은 종종 일관된 의도를 보이지만 행동 실행에서 실패한다. 또한, 행동 데이터에 대한 미세 조정은 원래 VLM의 일반주의 추론 능력을 훼손할 수 있다. 우리는 향후 VLA를 위한 표준 벤치마크로 사용되고 지각-행동 간격을 좁히는 연구를 촉진하기 위해 우리의 작업 스위트와 평가 코드를 공개한다. 소스 코드를 포함한 더 많은 정보는 https://ai4ce.github.io/INT-ACT/에서 확인할 수 있다.

English

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

의도에서 실행까지: 시각-언어-행동 모델의 일반화 경계 탐구

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

초록

Support