BridgeVLA: 비전-언어 모델을 활용한 효율적인 3D 조작 학습을 위한 입력-출력 정렬

초록

최근, 사전 학습된 시각-언어 모델(VLMs)을 활용하여 시각-언어-행동(VLA) 모델을 구축하는 것이 효과적인 로봇 조작 학습을 위한 유망한 접근 방식으로 부상하고 있습니다. 그러나 3D 신호를 VLM에 통합하여 행동 예측을 수행하는 방법은 소수에 불과하며, 이들은 3D 데이터에 내재된 공간 구조를 완전히 활용하지 못해 샘플 효율성이 낮은 문제가 있습니다. 본 논문에서는 (1) 3D 입력을 다수의 2D 이미지로 투영하여 VLM 백본과의 입력 정렬을 보장하고, (2) 행동 예측을 위해 2D 히트맵을 활용하여 입력과 출력 공간을 일관된 2D 이미지 공간 내에서 통일하는 새로운 3D VLA 모델인 BridgeVLA를 소개합니다. 또한, 다운스트림 정책 학습 전에 VLM 백본이 2D 히트맵을 예측할 수 있는 능력을 갖추도록 하는 확장 가능한 사전 학습 방법을 제안합니다. 광범위한 실험을 통해 제안된 방법이 3D 조작을 효율적이고 효과적으로 학습할 수 있음을 보여줍니다. BridgeVLA는 세 가지 시뮬레이션 벤치마크에서 최신 기준 방법들을 능가합니다. RLBench에서는 평균 성공률을 81.4%에서 88.2%로 향상시켰습니다. COLOSSEUM에서는 어려운 일반화 설정에서 상당히 더 나은 성능을 보이며, 평균 성공률을 56.7%에서 64.0%로 끌어올렸습니다. GemBench에서는 평균 성공률 측면에서 모든 비교 기준 방법들을 능가했습니다. 실제 로봇 실험에서 BridgeVLA는 최신 기준 방법보다 평균 32% 더 나은 성능을 보였습니다. 시각적 방해와 보이지 않는 지시를 포함한 여러 분포 외 설정에서도 강력하게 일반화되었습니다. 특히, 작업당 단 3개의 궤적으로 10개 이상의 작업에서 96.8%의 성공률을 달성하며, 탁월한 샘플 효율성을 입증했습니다. 프로젝트 웹사이트: https://bridgevla.github.io/

English

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

BridgeVLA: 비전-언어 모델을 활용한 효율적인 3D 조작 학습을 위한 입력-출력 정렬

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

초록

Support