InstructVLA: 이해에서 조작까지의 비전-언어-액션 명령어 튜닝

초록

실제 세계에서 효과적으로 작동하기 위해서는 로봇이 다중 모드 추론과 정밀한 행동 생성을 통합해야 합니다. 그러나 기존의 시각-언어-행동(VLA) 모델들은 종종 한쪽을 희생시키거나, 작업 특화적인 조작 데이터에만 능력을 제한하며, 사전 훈련된 시각-언어 능력의 치명적인 망각을 겪습니다. 이러한 격차를 해소하기 위해, 우리는 InstructVLA를 소개합니다. 이는 대규모 시각-언어 모델(VLM)의 유연한 추론 능력을 유지하면서도 선도적인 조작 성능을 제공하는 종단 간 VLA 모델입니다. InstructVLA는 새로운 훈련 패러다임인 시각-언어-행동 명령 튜닝(VLA-IT)을 도입하여, 전문가 혼합 적응을 통한 다중 모드 훈련을 통해 표준 VLM 코퍼스와 650K 샘플로 구성된 VLA-IT 데이터셋에서 텍스트 추론과 행동 생성을 공동으로 최적화합니다. 도메인 내 SimplerEnv 작업에서 InstructVLA는 SpatialVLA 대비 30.5%의 성능 향상을 달성했습니다. 일반화 능력을 평가하기 위해, 우리는 폐루프 제어와 고수준 명령 이해가 필요한 80개 작업 벤치마크인 SimplerEnv-Instruct를 도입했으며, 여기서 InstructVLA는 미세 조정된 OpenVLA를 92%, GPT-4o를 지원받은 행동 전문가를 29% 앞섰습니다. 또한, InstructVLA는 다중 모드 작업에서 기준 VLM을 능가하며, 텍스트 추론을 활용하여 시뮬레이션과 실제 환경 모두에서 조작 성능을 향상시키는 추론 시간 스케일링을 보여줍니다. 이러한 결과는 InstructVLA가 직관적이고 조정 가능한 인간-로봇 상호작용과 효율적인 정책 학습을 연결할 잠재력을 가지고 있음을 보여줍니다.

English

To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

InstructVLA: 이해에서 조작까지의 비전-언어-액션 명령어 튜닝

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

초록

Support