OpenVLA: オープンソースの視覚-言語-行動モデル

要旨

インターネット規模の視覚言語データと多様なロボットデモンストレーションを組み合わせて事前学習された大規模ポリシーは、ロボットに新しいスキルを教える方法を変える可能性があります。つまり、新しい動作をゼロから訓練するのではなく、そのような視覚言語行動（VLA）モデルを微調整することで、視覚運動制御のための堅牢で汎用性の高いポリシーを得ることができます。しかし、ロボティクスにおけるVLAの普及は困難でした。その理由は、1）既存のVLAはほとんどがクローズドで一般にアクセスできないこと、2）新しいタスクのためにVLAを効率的に微調整する方法がこれまでの研究で十分に探求されていないことです。これらの課題に対処するため、970,000件の実世界のロボットデモンストレーションの多様なコレクションで訓練された7BパラメータのオープンソースVLAであるOpenVLAを紹介します。OpenVLAは、Llama 2言語モデルを基盤とし、DINOv2とSigLIPから事前学習された特徴を融合する視覚エンコーダを組み合わせています。追加されたデータの多様性と新しいモデルコンポーネントの結果として、OpenVLAは汎用マニピュレーションにおいて強力な結果を示し、29のタスクと複数のロボット実装において、RT-2-X（55B）などのクローズドモデルを16.5%上回る絶対タスク成功率を達成し、パラメータ数は7分の1です。さらに、新しい設定に対してOpenVLAを効果的に微調整できることを示し、特に複数のオブジェクトを含むマルチタスク環境での汎化結果と強力な言語接地能力において、Diffusion Policyなどのゼロからの模倣学習手法を20.4%上回ります。また、計算効率についても探求し、OpenVLAが現代の低ランク適応手法を介してコンシューマーGPUで微調整でき、量子化を介して効率的に提供されても下流の成功率に影響がないことを示します。最後に、モデルチェックポイント、微調整ノートブック、およびOpen X-Embodimentデータセットでの大規模なVLA訓練をサポートするPyTorchコードベースを公開します。

English

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

OpenVLA: オープンソースの視覚-言語-行動モデル

OpenVLA: An Open-Source Vision-Language-Action Model

要旨

Support