ChatPaper.aiChatPaper

InstructVLA:從理解到操作的視覺-語言-動作指令微調

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

July 23, 2025
作者: Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang
cs.AI

摘要

要在現實世界中有效運作,機器人必須將多模態推理與精確的動作生成相結合。然而,現有的視覺-語言-動作(VLA)模型往往顧此失彼,將其能力局限於特定任務的操作數據,並且會嚴重遺忘預訓練的視覺-語言能力。為彌補這一差距,我們引入了InstructVLA,這是一種端到端的VLA模型,它保留了大型視覺-語言模型(VLM)的靈活推理能力,同時提供了領先的操作性能。InstructVLA引入了一種新穎的訓練範式——視覺-語言-動作指令微調(VLA-IT),該範式通過多模態訓練與專家混合適應相結合,共同優化文本推理和動作生成,並在標準的VLM語料庫和精心策劃的650K樣本VLA-IT數據集上進行訓練。在域內SimplerEnv任務中,InstructVLA相比SpatialVLA提升了30.5%。為評估泛化能力,我們引入了SimplerEnv-Instruct,這是一個包含80個任務的基準測試,要求閉環控制和高層次指令理解,在此測試中,InstructVLA超越了微調的OpenVLA 92%,並比由GPT-4o輔助的動作專家高出29%。此外,InstructVLA在多模態任務上超越了基線VLM,並通過利用文本推理在模擬和現實環境中提升操作性能,展示了推理時的擴展能力。這些結果表明,InstructVLA在橋接直觀且可操控的人機交互與高效策略學習方面具有巨大潛力。
English
To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
PDF121August 5, 2025