VLA-Adapter: 微小規模の視覚-言語-行動モデルにおける効果的なパラダイム

要旨

Vision-Language-Action（VLA）モデルは、通常、大規模なVision-Language Model（VLM）をロボットデータで事前学習させることで、知覚空間と行動空間のギャップを埋めます。このアプローチは性能を大幅に向上させますが、同時に多大なトレーニングコストも伴います。本論文では、視覚言語（VL）表現を行動（A）に効果的に橋渡しする方法を探ります。我々は、VLAモデルが大規模なVLMと広範な事前学習に依存することを軽減するために設計された新しいパラダイム、VLA-Adapterを紹介します。この目的のために、まずさまざまなVL条件の有効性を体系的に分析し、知覚空間と行動空間を橋渡しするためにどの条件が本質的であるかについての重要な知見を提示します。これらの洞察に基づいて、最適な条件を行動空間に自律的に注入するBridge Attentionを備えた軽量なPolicyモジュールを提案します。この方法により、我々の手法は、ロボットデータの事前学習を一切行わずに、わずか0.5Bパラメータのバックボーンを使用して高い性能を達成します。シミュレーションおよび実世界のロボットベンチマークでの広範な実験により、VLA-Adapterが最先端レベルの性能を達成するだけでなく、これまでに報告された中で最速の推論速度を提供することが実証されています。さらに、提案された高度な橋渡しパラダイムのおかげで、VLA-Adapterは、単一のコンシューマーグレードGPUでわずか8時間で強力なVLAモデルをトレーニングすることを可能にし、VLAモデルの展開障壁を大幅に低減します。プロジェクトページ：https://vla-adapter.github.io/。

English

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

VLA-Adapter: 微小規模の視覚-言語-行動モデルにおける効果的なパラダイム

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

要旨

Support