InstructVLA:從理解到操作的視覺-語言-動作指令微調
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
July 23, 2025
作者: Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang
cs.AI
摘要
要在現實世界中有效運作,機器人必須將多模態推理與精確的動作生成相結合。然而,現有的視覺-語言-動作(VLA)模型往往顧此失彼,將其能力局限於特定任務的操作數據,並且會嚴重遺忘預訓練的視覺-語言能力。為彌補這一差距,我們引入了InstructVLA,這是一種端到端的VLA模型,它保留了大型視覺-語言模型(VLM)的靈活推理能力,同時提供了領先的操作性能。InstructVLA引入了一種新穎的訓練範式——視覺-語言-動作指令微調(VLA-IT),該範式通過多模態訓練與專家混合適應相結合,共同優化文本推理和動作生成,並在標準的VLM語料庫和精心策劃的650K樣本VLA-IT數據集上進行訓練。在域內SimplerEnv任務中,InstructVLA相比SpatialVLA提升了30.5%。為評估泛化能力,我們引入了SimplerEnv-Instruct,這是一個包含80個任務的基準測試,要求閉環控制和高層次指令理解,在此測試中,InstructVLA超越了微調的OpenVLA 92%,並比由GPT-4o輔助的動作專家高出29%。此外,InstructVLA在多模態任務上超越了基線VLM,並通過利用文本推理在模擬和現實環境中提升操作性能,展示了推理時的擴展能力。這些結果表明,InstructVLA在橋接直觀且可操控的人機交互與高效策略學習方面具有巨大潛力。
English
To operate effectively in the real world, robots must integrate multimodal
reasoning with precise action generation. However, existing
vision-language-action (VLA) models often sacrifice one for the other, narrow
their abilities to task-specific manipulation data, and suffer catastrophic
forgetting of pre-trained vision-language capabilities. To bridge this gap, we
introduce InstructVLA, an end-to-end VLA model that preserves the flexible
reasoning of large vision-language models (VLMs) while delivering leading
manipulation performance. InstructVLA introduces a novel training paradigm,
Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal
training with mixture-of-experts adaptation to jointly optimize textual
reasoning and action generation on both standard VLM corpora and a curated
650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves
30.5% improvement over SpatialVLA. To evaluate generalization, we introduce
SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and
high-level instruction understanding, where it outperforms a fine-tuned OpenVLA
by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA
surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling
by leveraging textual reasoning to boost manipulation performance in both
simulated and real-world settings. These results demonstrate InstructVLA's
potential for bridging intuitive and steerable human-robot interaction with
efficient policy learning.