StableVLA: 追加データなしでロバストな視覚言語行動モデルを目指して

要旨

訓練データセットにすべての可能な外乱を含めることは非現実的である。このことは、視覚・言語・行動（VLA）モデルが、未経験の実世界における視覚的外乱、特に不完全な視覚条件下に遭遇した場合のロバスト性に関して、重要な疑問を提起する。本研究では、近年の最先端VLAモデルに基づく体系的な調査を行い、訓練データに含まれていない視覚的外乱が導入された際に、顕著な性能低下が生じることを明らかにする。この問題を緩和するために、情報理論に基づく軽量なアダプタモジュールであるInformation Bottleneck Adapter（IB-Adapter）を提案する。これは視覚入力から潜在的なノイズを選択的にフィルタリングするものである。IB-Adapterは、追加データや拡張戦略を一切必要とせず、パラメータの追加数が10M未満でありながら、ベースラインに対して平均30%の一貫した改善を示し、顕著な効率性と有効性を示す。さらに、14倍小さいバックボーン（0.5Bパラメータ）であり、Open X-Embodimentデータセットでの事前学習を行わなくても、我々のモデルStableVLAは7B規模の最先端VLAと競合するロバスト性を達成する。無視できる程度のパラメータオーバーヘッド（<10M）で、我々のアプローチは長期タスクにおいて精度を維持し、合成および物理的な視覚劣化の両方においてOpenPiを凌駕する。

English

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.