InstructVLA:从理解到操作的视觉-语言-动作指令微调
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
July 23, 2025
作者: Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang
cs.AI
摘要
要在现实世界中有效运作,机器人必须将多模态推理与精确动作生成相结合。然而,现有的视觉-语言-动作(VLA)模型往往顾此失彼,局限于特定任务的操作数据,并遭受预训练视觉-语言能力的灾难性遗忘。为弥合这一差距,我们推出了InstructVLA,一个端到端的VLA模型,它既保留了大型视觉-语言模型(VLM)的灵活推理能力,又实现了领先的操作性能。InstructVLA引入了一种新颖的训练范式——视觉-语言-动作指令调优(VLA-IT),通过多模态训练与专家混合适应相结合,共同优化文本推理和动作生成,在标准VLM语料库及精心策划的650K样本VLA-IT数据集上进行训练。在领域内的SimplerEnv任务中,InstructVLA相比SpatialVLA提升了30.5%。为评估泛化能力,我们提出了SimplerEnv-Instruct,一个包含80项任务的基准测试,要求闭环控制和高层次指令理解,在此测试中,InstructVLA超越了微调后的OpenVLA 92%,并比由GPT-4o辅助的动作专家高出29%。此外,InstructVLA在多模态任务上超越了基线VLM,并通过利用文本推理在模拟和现实环境中提升操作性能,展示了推理时的扩展能力。这些成果证明了InstructVLA在实现直观可控的人机交互与高效策略学习方面的潜力。
English
To operate effectively in the real world, robots must integrate multimodal
reasoning with precise action generation. However, existing
vision-language-action (VLA) models often sacrifice one for the other, narrow
their abilities to task-specific manipulation data, and suffer catastrophic
forgetting of pre-trained vision-language capabilities. To bridge this gap, we
introduce InstructVLA, an end-to-end VLA model that preserves the flexible
reasoning of large vision-language models (VLMs) while delivering leading
manipulation performance. InstructVLA introduces a novel training paradigm,
Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal
training with mixture-of-experts adaptation to jointly optimize textual
reasoning and action generation on both standard VLM corpora and a curated
650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves
30.5% improvement over SpatialVLA. To evaluate generalization, we introduce
SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and
high-level instruction understanding, where it outperforms a fine-tuned OpenVLA
by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA
surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling
by leveraging textual reasoning to boost manipulation performance in both
simulated and real-world settings. These results demonstrate InstructVLA's
potential for bridging intuitive and steerable human-robot interaction with
efficient policy learning.