StableVLA: 邁向無需額外數據的穩健視覺-語言-動作模型

摘要

不可能在训练数据集中涵盖所有可能的干扰。这引发了一个关于视觉-语言-动作（VLA）模型在遇到未见过的真实世界视觉干扰时的鲁棒性的关键问题，特别是在不完美的视觉条件下。在这项工作中，我们基于近期最先进的VLA模型进行了系统性研究，并揭示了当引入训练数据中未出现的视觉干扰时，模型性能显著下降。为解决此问题，我们提出一种基于信息理论的轻量级适配器模块，称为信息瓶颈适配器（IB-Adapter），它可选择性地过滤视觉输入中的潜在噪声。无需任何额外数据或增强策略，IB-Adapter 在平均性能上比基线提升30%，同时仅增加不到1000万个参数，展现出显著的效率和有效性。此外，即使使用小14倍的骨干网络（0.5B参数）且未在Open X-Embodiment数据集上进行预训练，我们的模型StableVLA也能实现与7B级别最先进VLA相媲美的鲁棒性。在可忽略的参数开销（<10M）下，我们的方法在长时域任务上保持准确性，并在合成和物理视觉损坏情况下超越OpenPi。

English

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.