BridgeVLA：面向高效三维操作学习的视觉-语言模型输入输出对齐

摘要

近期，利用预训练的视觉-语言模型（VLMs）构建视觉-语言-动作（VLA）模型已成为实现高效机器人操作学习的一种有前景的方法。然而，仅有少数方法将3D信号融入VLMs以进行动作预测，且未能充分利用3D数据中固有的空间结构，导致样本效率低下。本文提出BridgeVLA，一种新颖的3D VLA模型，其特点在于：(1) 将3D输入投影至多幅2D图像，确保输入与VLM主干网络对齐；(2) 利用2D热图进行动作预测，统一输入与输出空间于一致的2D图像空间内。此外，我们提出了一种可扩展的预训练方法，使VLM主干网络在下游策略学习前即具备预测2D热图的能力。大量实验表明，所提方法能高效且有效地学习3D操作。BridgeVLA在三个仿真基准测试中均超越了现有最先进的基线方法。在RLBench中，其平均成功率从81.4%提升至88.2%。在COLOSSEUM中，面对具有挑战性的泛化场景，它展现出显著更优的性能，平均成功率从56.7%提升至64.0%。在GemBench中，其平均成功率超越所有对比基线方法。在真实机器人实验中，BridgeVLA平均优于最先进的基线方法32%。它在多种分布外设置下展现出强大的泛化能力，包括视觉干扰和未见过的指令。尤为突出的是，仅需每个任务3条轨迹，它便能在10+任务上实现96.8%的成功率，彰显了其卓越的样本效率。项目网站：https://bridgevla.github.io/

English

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

BridgeVLA：面向高效三维操作学习的视觉-语言模型输入输出对齐

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

摘要

Support