VLA-Adapter：微型視覺-語言-動作模型的有效範式

摘要

視覺-語言-動作（VLA）模型通常通過在機器人數據上預訓練大規模視覺-語言模型（VLM）來彌合感知空間與動作空間之間的差距。雖然這種方法顯著提升了性能，但也帶來了高昂的訓練成本。本文探討了如何有效地將視覺-語言（VL）表徵與動作（A）相連接。我們提出了VLA-Adapter，這是一種新穎的範式，旨在減少VLA模型對大規模VLM和廣泛預訓練的依賴。為此，我們首先系統分析了各種VL條件的有效性，並提出了哪些條件對於連接感知與動作空間至關重要的關鍵發現。基於這些洞察，我們提出了一個帶有橋接注意力的輕量級策略模塊，它能自主地將最佳條件注入動作空間。通過這種方式，我們的方法僅使用0.5B參數的骨幹網絡，無需任何機器人數據預訓練，即可實現高性能。在模擬和真實世界機器人基準上的大量實驗表明，VLA-Adapter不僅達到了最先進的性能水平，還提供了迄今為止最快的推理速度。此外，得益於所提出的先進橋接範式，VLA-Adapter使得在單個消費級GPU上僅需8小時即可訓練出一個強大的VLA模型，大大降低了部署VLA模型的門檻。項目頁面：https://vla-adapter.github.io/。

English

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

VLA-Adapter：微型視覺-語言-動作模型的有效範式

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

摘要

Support