StableVLA: 추가 데이터 없이 강건한 시각-언어-행동 모델을 위하여

초록

훈련 데이터셋 내에서 모든 가능한 교란을 포함하는 것은 실현 불가능하다. 이는 특히 불완전한 시각 조건 하에서, 학습되지 않은 실제 시각적 교란에 직면했을 때 시각-언어-행동( Vision-Language-Action, VLA) 모델의 강건성에 관한 중요한 질문을 제기한다. 본 연구에서는 최신 VLA 모델들을 기반으로 체계적인 연구를 수행하여, 훈련 데이터에 존재하지 않는 시각적 교란이 도입될 때 상당한 성능 저하가 발생함을 밝혀냈다. 이 문제를 완화하기 위해, 정보 이론에 기반한 경량 어댑터 모듈인 정보 병목 어댑터(IB-Adapter)를 제안하며, 이는 시각 입력에서 잠재적 노이즈를 선택적으로 필터링한다. 추가 데이터나 증강 전략 없이도 IB-Adapter는 베이스라인 대비 평균 30%의 성능 향상을 일관되게 달성하며, 1천만 개 미만의 파라미터만 추가하여 뛰어난 효율성과 효과성을 보여준다. 또한, 14배 더 작은 백본(0.5B 파라미터)과 Open X-Embodiment 데이터셋에 대한 사전 훈련 없이도, 당사의 모델 StableVLA는 7B 규모의 최신 VLA 모델과 경쟁할 만한 강건성을 달성한다. 미미한 파라미터 오버헤드(<10M)로, 장기 과제에서의 정확도를 유지하며, 합성 및 물리적 시각 손상 모두에서 OpenPi를 능가한다.

English

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.