BridgeVLA：基於視覺語言模型的高效3D操作學習之輸入輸出對齊

摘要

近期，利用预训练的视觉-语言模型（VLMs）构建视觉-语言-动作（VLA）模型已成为实现高效机器人操作学习的一种有前景的方法。然而，仅有少数方法将3D信号整合到VLMs中以进行动作预测，且它们未能充分利用3D数据中固有的空间结构，导致样本效率低下。本文中，我们提出了BridgeVLA，一种新颖的3D VLA模型，它（1）将3D输入投影为多幅2D图像，确保输入与VLM骨干网络对齐；（2）利用2D热图进行动作预测，将输入和输出空间统一在一个一致的2D图像空间内。此外，我们提出了一种可扩展的预训练方法，使VLM骨干网络在下游策略学习之前就具备预测2D热图的能力。大量实验表明，所提方法能够高效且有效地学习3D操作。BridgeVLA在三个仿真基准测试中均超越了最先进的基线方法。在RLBench中，它将平均成功率从81.4%提升至88.2%。在COLOSSEUM中，它在具有挑战性的泛化设置中表现出显著更优的性能，将平均成功率从56.7%提高至64.0%。在GemBench中，它在平均成功率方面超越了所有对比的基线方法。在真实机器人实验中，BridgeVLA平均比最先进的基线方法高出32%。它在多种分布外设置中展现出强大的泛化能力，包括视觉干扰和未见过的指令。尤为突出的是，它能够在每项任务仅需3条轨迹的情况下，在10+项任务中达到96.8%的成功率，彰显了其卓越的样本效率。项目网站：https://bridgevla.github.io/

English

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

BridgeVLA：基於視覺語言模型的高效3D操作學習之輸入輸出對齊

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

摘要

Support