VLA-Adapter：微型视觉-语言-动作模型的高效范式

摘要

视觉-语言-动作（VLA）模型通常通过在机器人数据上预训练大规模视觉-语言模型（VLM）来弥合感知空间与动作空间之间的鸿沟。尽管这种方法显著提升了性能，但也带来了高昂的训练成本。本文探讨了如何有效桥接视觉-语言（VL）表征与动作（A）。我们提出了VLA-Adapter，一种旨在减少VLA模型对大规模VLM和广泛预训练依赖的新范式。为此，我们首先系统分析了各种VL条件的有效性，并揭示了哪些条件对于连接感知与动作空间至关重要。基于这些洞见，我们设计了一个轻量级的策略模块，配备桥接注意力机制，能够自主地将最优条件注入动作空间。由此，我们的方法仅需0.5B参数的骨干网络，无需任何机器人数据预训练，即可实现高性能。在仿真及真实世界机器人基准上的大量实验表明，VLA-Adapter不仅达到了最先进的性能水平，还提供了迄今为止最快的推理速度。此外，得益于所提出的先进桥接范式，VLA-Adapter使得在单个消费级GPU上仅需8小时即可训练出强大的VLA模型，大大降低了部署VLA模型的门槛。项目页面：https://vla-adapter.github.io/。

English

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

VLA-Adapter：微型视觉-语言-动作模型的高效范式

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

摘要

Support