BitVLA: ロボット操作のための1ビット視覚-言語-行動モデル

要旨

Vision-Language-Action（VLA）モデルは、幅広いロボティクス操作タスクにおいて印象的な能力を示しています。しかし、そのモデルサイズの増大は、リソースに制約のあるロボティクスシステムへの展開において重大な課題を引き起こしています。1ビット事前学習は、大規模言語モデルの推論効率を最小限の性能損失で向上させるために有効であることが証明されていますが、VLAモデルへの適用はまだ十分に検討されていません。本研究では、すべてのパラメータが三値（{-1, 0, 1}）である、ロボティクス操作のための最初の1ビットVLAモデルであるBitVLAを提案します。さらに、ビジョンエンコーダのメモリフットプリントを削減するために、フルプレシジョンエンコーダを1.58ビットの重みに圧縮する蒸留対応トレーニング戦略を提案します。このプロセスでは、フルプレシジョンエンコーダが教師モデルとして機能し、潜在表現をより良く整合させます。大規模なロボティクス事前学習が不足しているにもかかわらず、BitVLAはLIBEROベンチマークにおいて、4ビットのポストトレーニング量子化を施した最新モデルOpenVLA-OFTと同等の性能を達成し、メモリ使用量はわずか29.8%です。これらの結果は、BitVLAがメモリに制約のあるエッジデバイスへの展開において有望であることを示しています。コードとモデル重みはhttps://github.com/ustcwhy/BitVLAで公開しています。

English

Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.

BitVLA: ロボット操作のための1ビット視覚-言語-行動モデル

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

要旨

Support