StableVLA: 迈向无需额外数据的鲁棒视觉-语言-动作模型
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
May 18, 2026
作者: Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou
cs.AI
摘要
将所有可能的扰动都涵盖在训练数据集中是不可行的。这引发了一个关键问题:当面对未见过的真实世界视觉扰动,尤其是在不完美的视觉条件下,视觉-语言-动作(VLA)模型的鲁棒性如何?在本研究中,我们基于当前最先进的VLA模型开展系统性研究,揭示了当引入训练数据中不存在的视觉扰动时,模型性能会显著下降。针对这一问题,我们提出了一种基于信息理论的轻量级适配器模块——信息瓶颈适配器(IB-Adapter),它能选择性地过滤视觉输入中的潜在噪声。无需额外数据或增强策略,IB-Adapter在基线基础上平均提升30%,且仅增加不到1000万参数,展现出显著的效率与有效性。此外,即使采用小14倍的骨干网络(5亿参数)且未在Open X-Embodiment数据集上进行预训练,我们的模型StableVLA也能达到与70亿参数级别的先进VLA模型相当的鲁棒性。在参数开销可忽略不计(<1000万)的情况下,我们的方法在长时域任务上保持了准确性,并在合成与现实物理视觉损坏场景下均超越了OpenPi。
English
It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.