Qwen-VLA:統一跨任務、環境與機器人本體的視覺-語言-動作建模
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
May 28, 2026
作者: Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen
cs.AI
摘要
具身智能常通过针对特定任务(如操作或导航)的专业模型进行研究,导致能力碎片化,且在不同任务、环境及机器人本体之间的泛化能力有限。本研究中,我们探讨是否可以将异构的具身决策问题统一到单一的视觉-语言-动作模型中。我们提出Qwen-VLA,这是一个统一的具身基础模型,它将Qwen的视觉-语言建模栈从感知、理解和推理扩展至连续动作与轨迹生成,其核心是基于DiT的动作解码器。Qwen-VLA通过大规模联合预训练方法,在多样化的数据源上进行训练,包括机器人操作轨迹、人类第一人称示范、合成仿真数据、视觉与语言导航数据、轨迹中心监督数据以及辅助视觉-语言数据。为支持多种机器人平台,我们引入了具身感知提示条件,其中机器人特定的文本描述指定了当前本体及其控制约定。我们进一步将操作、导航和轨迹预测统一到动作与轨迹预测框架中,从而使视觉定位、空间推理和连续动作生成能够在不同机器人形态、任务族和环境之间实现可迁移。在操作、导航和轨迹中心基准上的实验表明,该模型在场景布局、背景、光照、物体配置和机器人本体变化下,具备一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%,在Simpler-WidowX上达到73.7%,在RoboTwin-Easy/Hard上分别达到86.1%/87.2%,在R2R上OSR为69.0%,在RxR上SR为59.6%,在真实世界ALOHA实验中平均OOD成功率为76.9%,以及在DOMINO动态操作上的零样本成功率为26.6%。
English
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.