InstructVLA: 理解から操作への視覚-言語-動作指示チューニング

要旨

現実世界で効果的に動作するためには、ロボットはマルチモーダルな推論と精密な動作生成を統合する必要がある。しかし、既存の視覚-言語-動作（VLA）モデルは、しばしば一方を犠牲にし、タスク固有の操作データに能力を限定し、事前学習された視覚-言語能力の壊滅的な忘却に悩まされる。このギャップを埋めるため、我々はInstructVLAを導入する。これは、大規模視覚-言語モデル（VLM）の柔軟な推論を保持しつつ、最先端の操作性能を提供するエンドツーエンドのVLAモデルである。InstructVLAは、新しいトレーニングパラダイムである視覚-言語-動作指示チューニング（VLA-IT）を導入し、マルチモーダルトレーニングとエキスパートの混合適応を用いて、標準的なVLMコーパスと精選された650KサンプルのVLA-ITデータセット上で、テキスト推論と動作生成を共同で最適化する。ドメイン内のSimplerEnvタスクでは、InstructVLAはSpatialVLAに対して30.5%の改善を達成する。一般化を評価するため、我々はSimplerEnv-Instructを導入する。これは、閉ループ制御と高レベルの指示理解を必要とする80タスクのベンチマークであり、ここではファインチューニングされたOpenVLAを92%、GPT-4oを支援した動作エキスパートを29%上回る。さらに、InstructVLAはマルチモーダルタスクにおいてベースラインVLMを凌駕し、テキスト推論を活用してシミュレーションおよび現実世界の設定での操作性能を向上させる推論時のスケーリングを示す。これらの結果は、InstructVLAが直感的で操縦可能な人間-ロボットインタラクションと効率的なポリシー学習を橋渡しする可能性を示している。

English

To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

InstructVLA: 理解から操作への視覚-言語-動作指示チューニング

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

要旨

Support