OpenVLA：一个开源的视觉-语言-动作模型

摘要

基于互联网规模的视觉-语言数据和多样化机器人演示进行大规模预训练的政策具有改变我们教导机器人新技能的潜力：我们可以通过微调这种视觉-语言-动作（VLA）模型，而不是从头开始训练新行为，从而获得用于视觉运动控制的稳健、通用的政策。然而，广泛采用VLA用于机器人技术方面存在挑战，因为1）现有的VLA主要是封闭的，公众无法访问，2）先前的研究未能探索有效微调VLA以用于新任务的方法，这是采用的关键组成部分。为了解决这些挑战，我们介绍了OpenVLA，这是一个拥有70亿参数的开源VLA，经过多样化的97万真实世界机器人演示训练而成。OpenVLA基于Llama 2语言模型，结合了来自DINOv2和SigLIP的预训练特征的视觉编码器。由于增加的数据多样性和新的模型组件，OpenVLA在通用操作方面表现出色，对于29个任务和多个机器人实体，绝对任务成功率比RT-2-X（550亿）等封闭模型高出16.5%，并且参数数量少7倍。我们进一步展示，我们可以有效地为新设置微调OpenVLA，在涉及多个对象和强语言基础能力的多任务环境中表现出特别强的泛化结果，并且比Diffusion Policy等从头开始的模仿学习方法高出20.4%。我们还探讨了计算效率；作为另一个贡献，我们展示OpenVLA可以通过现代低秩适应方法在消费级GPU上进行微调，并通过量化高效地提供服务，而不会影响下游成功率。最后，我们发布了模型检查点、微调笔记本以及我们的PyTorch代码库，内置支持在Open X-Embodiment数据集上规模训练VLA。

English

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

OpenVLA：一个开源的视觉-语言-动作模型

OpenVLA: An Open-Source Vision-Language-Action Model

摘要

Support