OpenVLA: 오픈소스 비전-언어-액션 모델

초록

인터넷 규모의 시각-언어 데이터와 다양한 로봇 데모 데이터를 결합해 사전 학습된 대형 정책 모델은 로봇에게 새로운 기술을 가르치는 방식을 바꿀 잠재력을 가지고 있습니다. 새로운 동작을 처음부터 학습시키는 대신, 이러한 시각-언어-행동(VLA) 모델을 미세 조정하여 강력하고 일반화 가능한 시각운동 제어 정책을 얻을 수 있습니다. 그러나 로봇 공학 분야에서 VLA의 광범위한 채택은 두 가지 주요 문제로 인해 어려움을 겪고 있습니다: 1) 기존 VLA 모델은 대부분 폐쇄적이며 공개적으로 접근할 수 없고, 2) 기존 연구는 새로운 작업에 대해 VLA를 효율적으로 미세 조정하는 방법을 탐구하지 못했습니다. 이러한 문제를 해결하기 위해, 우리는 970,000개의 실제 로봇 데모 데이터를 기반으로 학습된 70억 파라미터 규모의 오픈소스 VLA인 OpenVLA를 소개합니다. OpenVLA는 Llama 2 언어 모델에 DINOv2와 SigLIP의 사전 학습된 특징을 융합한 시각 인코더를 결합하여 구축되었습니다. 데이터 다양성과 새로운 모델 구성 요소의 추가로 인해, OpenVLA는 일반적인 조작 작업에서 강력한 성능을 보이며, RT-2-X(550억 파라미터)와 같은 폐쇄형 모델을 29개 작업과 여러 로봇 구현체에서 절대 작업 성공률 기준 16.5% 앞서는 동시에 파라미터 수는 7배 적습니다. 또한, OpenVLA를 새로운 환경에 효과적으로 미세 조정할 수 있으며, 특히 다중 객체와 강력한 언어 기반 능력을 포함한 다중 작업 환경에서 우수한 일반화 성능을 보이고, Diffusion Policy와 같은 처음부터 학습하는 모방 학습 방법을 20.4% 앞섭니다. 우리는 또한 계산 효율성을 탐구했으며, 별도의 기여로 OpenVLA가 현대적인 저순위 적응 방법을 통해 소비자용 GPU에서 미세 조정될 수 있고, 양자화를 통해 다운스트림 성공률에 영향을 주지 않으면서 효율적으로 서빙될 수 있음을 보여줍니다. 마지막으로, 모델 체크포인트, 미세 조정 노트북, 그리고 Open X-Embodiment 데이터셋에서 대규모로 VLA를 학습할 수 있는 내장 지원을 포함한 PyTorch 코드베이스를 공개합니다.

English

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

OpenVLA: 오픈소스 비전-언어-액션 모델

OpenVLA: An Open-Source Vision-Language-Action Model

초록

Support