Qwen-VLA:统一跨任务、跨环境、跨机器人形态的视觉-语言-动作建模
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
May 28, 2026
作者: Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen
cs.AI
摘要
具身智能通常通过针对操纵、导航等单一任务的专用模型进行研究,导致能力碎片化,且在任务、环境及机器人本体间的泛化能力有限。本文研究异构的具身决策问题能否统一在单个视觉-语言-动作模型之中。我们提出Qwen-VLA,一个统一的具身基础模型,它将Qwen的视觉-语言建模栈从感知、理解、推理扩展到连续动作和轨迹生成,通过基于DiT的动作解码器实现。Qwen-VLA采用大规模联合预训练方案,在多样化数据源上进行训练,包括机器人操纵轨迹、人类第一人称演示、合成仿真数据、视觉-语言导航数据、轨迹中心化监督以及辅助的视觉-语言数据。为支持多种机器人平台,我们引入具身感知提示条件化,其中机器人特定的文本描述指定当前本体和控制约定。我们进一步将操纵、导航和轨迹预测统一为一个动作-轨迹联合预测框架,从而在机器人形态、任务族和环境之间实现可迁移的视觉定位、空间推理和连续动作生成。在操纵、导航和轨迹中心化基准上的实验表明,Qwen-VLA在场景布局、背景、光照、物体配置和机器人本体变化下均展现出一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%,在Simpler-WidowX上达到73.7%,在RoboTwin-Easy/Hard上达到86.1%/87.2%,在R2R上OSR为69.0%,在RxR上SR为59.6%,在真实世界ALOHA实验中的平均OOD成功率为76.9%,在DOMINO动态操纵中零样本成功率为26.6%。
English
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.