OpenVLA:一个开源的视觉-语言-动作模型
OpenVLA: An Open-Source Vision-Language-Action Model
June 13, 2024
作者: Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn
cs.AI
摘要
基于互联网规模的视觉-语言数据和多样化机器人演示进行大规模预训练的政策具有改变我们教导机器人新技能的潜力:我们可以通过微调这种视觉-语言-动作(VLA)模型,而不是从头开始训练新行为,从而获得用于视觉运动控制的稳健、通用的政策。然而,广泛采用VLA用于机器人技术方面存在挑战,因为1)现有的VLA主要是封闭的,公众无法访问,2)先前的研究未能探索有效微调VLA以用于新任务的方法,这是采用的关键组成部分。为了解决这些挑战,我们介绍了OpenVLA,这是一个拥有70亿参数的开源VLA,经过多样化的97万真实世界机器人演示训练而成。OpenVLA基于Llama 2语言模型,结合了来自DINOv2和SigLIP的预训练特征的视觉编码器。由于增加的数据多样性和新的模型组件,OpenVLA在通用操作方面表现出色,对于29个任务和多个机器人实体,绝对任务成功率比RT-2-X(550亿)等封闭模型高出16.5%,并且参数数量少7倍。我们进一步展示,我们可以有效地为新设置微调OpenVLA,在涉及多个对象和强语言基础能力的多任务环境中表现出特别强的泛化结果,并且比Diffusion Policy等从头开始的模仿学习方法高出20.4%。我们还探讨了计算效率;作为另一个贡献,我们展示OpenVLA可以通过现代低秩适应方法在消费级GPU上进行微调,并通过量化高效地提供服务,而不会影响下游成功率。最后,我们发布了模型检查点、微调笔记本以及我们的PyTorch代码库,内置支持在Open X-Embodiment数据集上规模训练VLA。
English
Large policies pretrained on a combination of Internet-scale vision-language
data and diverse robot demonstrations have the potential to change how we teach
robots new skills: rather than training new behaviors from scratch, we can
fine-tune such vision-language-action (VLA) models to obtain robust,
generalizable policies for visuomotor control. Yet, widespread adoption of VLAs
for robotics has been challenging as 1) existing VLAs are largely closed and
inaccessible to the public, and 2) prior work fails to explore methods for
efficiently fine-tuning VLAs for new tasks, a key component for adoption.
Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source
VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA builds on a Llama 2 language model combined with a visual encoder that
fuses pretrained features from DINOv2 and SigLIP. As a product of the added
data diversity and new model components, OpenVLA demonstrates strong results
for generalist manipulation, outperforming closed models such as RT-2-X (55B)
by 16.5% in absolute task success rate across 29 tasks and multiple robot
embodiments, with 7x fewer parameters. We further show that we can effectively
fine-tune OpenVLA for new settings, with especially strong generalization
results in multi-task environments involving multiple objects and strong
language grounding abilities, and outperform expressive from-scratch imitation
learning methods such as Diffusion Policy by 20.4%. We also explore compute
efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned
on consumer GPUs via modern low-rank adaptation methods and served efficiently
via quantization without a hit to downstream success rate. Finally, we release
model checkpoints, fine-tuning notebooks, and our PyTorch codebase with
built-in support for training VLAs at scale on Open X-Embodiment datasets.Summary
AI-Generated Summary