BitVLA：用於機器人操作的1位元視覺-語言-動作模型

摘要

視覺-語言-動作（VLA）模型在多種機器人操作任務中展現了卓越的能力。然而，其日益增長的模型規模對資源受限的機器人系統部署構成了重大挑戰。雖然1位元預訓練已證明能有效提升大型語言模型的推理效率且性能損失最小，但其在VLA模型中的應用仍未被充分探索。在本研究中，我們提出了BitVLA，這是首個用於機器人操作的1位元VLA模型，其中每個參數均為三元值，即{-1, 0, 1}。為了進一步減少視覺編碼器的記憶體佔用，我們提出了蒸餾感知訓練策略，將全精度編碼器壓縮至1.58位元權重。在此過程中，全精度編碼器作為教師模型，以更好地對齊潛在表示。儘管缺乏大規模的機器人預訓練，BitVLA在LIBERO基準測試中與採用4位元後訓練量化的最先進模型OpenVLA-OFT表現相當，同時僅消耗29.8%的記憶體。這些結果凸顯了BitVLA在記憶體受限的邊緣設備上部署的潛力。我們在https://github.com/ustcwhy/BitVLA上公開了代碼和模型權重。

English

Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.

BitVLA：用於機器人操作的1位元視覺-語言-動作模型

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

摘要

Support