BitVLA: 로봇 조작을 위한 1비트 비전-언어-행동 모델

초록

비전-언어-행동(Vision-Language-Action, VLA) 모델은 다양한 로봇 조작 작업에서 인상적인 성능을 보여주고 있습니다. 그러나 이들의 모델 크기가 점점 커지면서 자원이 제한된 로봇 시스템에 배포하는 데 상당한 어려움이 발생하고 있습니다. 1비트 사전 학습은 대규모 언어 모델의 추론 효율성을 최소한의 성능 손실로 향상시키는 데 효과적임이 입증되었지만, VLA 모델에 대한 적용은 아직 충분히 탐구되지 않았습니다. 본 연구에서는 모든 파라미터가 삼항 값({-1, 0, 1})을 가지는 최초의 1비트 VLA 모델인 BitVLA를 제안합니다. 비전 인코더의 메모리 사용량을 더욱 줄이기 위해, 우리는 1.58비트 가중치로 완전 정밀도 인코더를 압축하는 지식 증류 기반 학습 전략을 제안합니다. 이 과정에서 완전 정밀도 인코더는 교사 모델로 작용하여 잠재 표현을 더 잘 정렬합니다. 대규모 로봇 사전 학습이 부족함에도 불구하고, BitVLA는 LIBERO 벤치마크에서 4비트 사후 학습 양자화를 적용한 최신 모델인 OpenVLA-OFT와 비슷한 성능을 달성하면서 메모리 사용량은 단 29.8%만 소모합니다. 이러한 결과는 BitVLA가 메모리가 제한된 에지 디바이스에 배포하기에 매우 유망함을 보여줍니다. 우리는 코드와 모델 가중치를 https://github.com/ustcwhy/BitVLA에서 공개합니다.

English

Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.

BitVLA: 로봇 조작을 위한 1비트 비전-언어-행동 모델

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

초록

Support