InstructVLA：從理解到操作的視覺-語言-動作指令微調

摘要

要在現實世界中有效運作，機器人必須將多模態推理與精確的動作生成相結合。然而，現有的視覺-語言-動作（VLA）模型往往顧此失彼，將其能力局限於特定任務的操作數據，並且會嚴重遺忘預訓練的視覺-語言能力。為彌補這一差距，我們引入了InstructVLA，這是一種端到端的VLA模型，它保留了大型視覺-語言模型（VLM）的靈活推理能力，同時提供了領先的操作性能。InstructVLA引入了一種新穎的訓練範式——視覺-語言-動作指令微調（VLA-IT），該範式通過多模態訓練與專家混合適應相結合，共同優化文本推理和動作生成，並在標準的VLM語料庫和精心策劃的650K樣本VLA-IT數據集上進行訓練。在域內SimplerEnv任務中，InstructVLA相比SpatialVLA提升了30.5%。為評估泛化能力，我們引入了SimplerEnv-Instruct，這是一個包含80個任務的基準測試，要求閉環控制和高層次指令理解，在此測試中，InstructVLA超越了微調的OpenVLA 92%，並比由GPT-4o輔助的動作專家高出29%。此外，InstructVLA在多模態任務上超越了基線VLM，並通過利用文本推理在模擬和現實環境中提升操作性能，展示了推理時的擴展能力。這些結果表明，InstructVLA在橋接直觀且可操控的人機交互與高效策略學習方面具有巨大潛力。

English

To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

InstructVLA：從理解到操作的視覺-語言-動作指令微調

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

摘要

Support