BridgeVLA: 視覚言語モデルを用いた効率的な3D操作学習のための入出力アライメント

要旨

近年、事前学習済みの視覚言語モデル（VLM）を活用して視覚言語行動（VLA）モデルを構築することが、効果的なロボット操作学習の有望なアプローチとして注目されています。しかし、3D信号をVLMに組み込んで行動予測を行う手法は少なく、3Dデータに内在する空間構造を十分に活用できていないため、サンプル効率が低いという課題があります。本論文では、BridgeVLAという新しい3D VLAモデルを提案します。このモデルは、(1) 3D入力を複数の2D画像に投影し、VLMバックボーンとの入力整合性を確保し、(2) 2Dヒートマップを活用して行動予測を行うことで、入力と出力空間を一貫した2D画像空間に統一します。さらに、VLMバックボーンが2Dヒートマップを予測する能力を獲得するためのスケーラブルな事前学習手法を提案します。大規模な実験により、提案手法が3D操作を効率的かつ効果的に学習できることが示されました。BridgeVLAは、3つのシミュレーションベンチマークにおいて、最先端のベースライン手法を上回りました。RLBenchでは、平均成功率を81.4%から88.2%に向上させました。COLOSSEUMでは、困難な一般化設定において大幅に優れた性能を示し、平均成功率を56.7%から64.0%に引き上げました。GemBenchでは、平均成功率においてすべての比較対象ベースライン手法を凌駕しました。実ロボット実験では、BridgeVLAは最先端のベースライン手法を平均32%上回りました。視覚的擾乱や未見の指示を含む複数の分布外設定においても頑健に一般化し、特に、タスクごとに3軌跡のみで10以上のタスクにおいて96.8%の成功率を達成し、その驚異的なサンプル効率を実証しました。プロジェクトウェブサイト: https://bridgevla.github.io/

English

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

BridgeVLA: 視覚言語モデルを用いた効率的な3D操作学習のための入出力アライメント

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

要旨

Support