Qwen-VLA: 작업, 환경, 로봇 구현체 전반에 걸친 시각-언어-행동 모델링의 통합

초록

체화된 지능은 종종 조작이나 내비게이션 같은 개별 과제를 위한 특화된 모델을 통해 연구되어 왔으며, 이로 인해 기능이 단편화되고 과제, 환경, 로봇 체화 간 일반화가 제한된다. 본 연구에서는 이질적인 체화 의사결정 문제를 단일 시각-언어-행동 모델 내에서 통합할 수 있는지를 탐구한다. 우리는 Qwen-VLA를 제시한다. 이는 Qwen의 시각-언어 모델링 스택을 인지, 이해, 추론에서부터 DiT 기반 동작 디코더를 통한 연속 동작 및 궤적 생성으로 확장한 통합 체화 기반 모델이다. Qwen-VLA는 로봇 조작 궤적, 인간 자기중심 시연, 합성 시뮬레이션 데이터, 시각-언어 내비게이션 데이터, 궤적 중심 감독, 보조 시각-언어 데이터 등 다양한 데이터 소스를 포함한 대규모 공동 사전 훈련 레시피를 통해 학습된다. 여러 로봇 플랫폼을 지원하기 위해, 로봇 특정 텍스트 설명이 현재 체화와 제어 규약을 명시하는 체화 인식 프롬프트 조건화를 도입한다. 또한 조작, 내비게이션, 궤적 예측을 통합된 동작 및 궤적 예측 프레임워크로 변환하여, 로봇 형태, 과제군, 환경 전반에 걸쳐 전이 가능한 시각적 접지, 공간 추론, 연속 동작 생성을 가능하게 한다. 조작, 내비게이션, 궤적 중심 벤치마크에 대한 실험은 장면 배치, 배경, 조명, 객체 구성, 로봇 체화의 변동 하에서 일관된 다중 과제 성능과 분포 외 일반화를 보여준다. Qwen-VLA-Instruct는 LIBERO에서 97.9%, Simpler-WidowX에서 73.7%, RoboTwin-Easy/Hard에서 86.1%/87.2%, R2R에서 69.0% OSR, RxR에서 59.6% SR, 실제 ALOHA 실험에서 평균 76.9% OOD 성공률, DOMINO 동적 조작에서 26.6% 제로샷 성공률을 달성한다.

English

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.