ドロップ・アンド・リカバリー：視覚言語行動モデルはどの程度冗長なのか？

要旨

視覚・言語・行動（VLA）モデルは指示駆動型のロボット操作を可能にするが、事前学習済みVLMから継承した言語バックボーンは過度に大きく、その容量は短いロボット指示に必要なものをはるかに超えている。このことから、閉ループ制御に実際に必要なVLAモデルの容量はどれほどか、という基本的な問いが生じる。本研究では、トランスフォーマーブロックの除去を制御された介入として用い、VLAモデルのアーキテクチャ上の冗長性を調査する。我々はDrop-Then-Recovery（DTR）という解析手法を導入する。これは、事前学習済みVLAモデルから選択したブロックを除去し、その結果得られたモデルを微調整して、除去された容量が下流の制御に必要であったかどうかを測定するものである。この介入を信頼性の高いものにするため、GateProbeを提案する。これはワンショットの仮想ゲート感度指標であり、ブロックの下流動作損失への寄与度に基づいてランク付けを行う。複数のVLAアーキテクチャ、操作ベンチマーク、さらには実ロボットの産業シナリオにおいても、除去後の回復可能性に強い非対称性が見られる。すなわち、言語バックボーンは標準的なロボット操作タスクに対して高い冗長性を持つ一方、視覚経路と行動経路は除去に対する耐性が著しく低い。LIBEROでは、LLMブロックの半数を除去することで、同じ下流微調整予算の下でOpenVLA-OFTが95.0%から98.3%に改善され、言語ブロックを2つだけ残してもベースラインレベルの性能を回復する。これらの結果は、現在のVLAベンチマークが深い言語接地や構成的指示理解に対する圧力を十分に課していない可能性を示唆しており、将来のVLAアーキテクチャは言語、視覚、行動の各構成要素に対してより意図的に容量を配分すべきであることを示している。コードはhttps://github.com/s1ghhh/VLADropで公開されている。

English

Vision-Language-Action (VLA) models enable instruction-driven robotic manipulation, but they inherit oversized language backbones from pretrained VLMs whose capacity far exceeds what is needed for short robotic instructions. This raises a basic question: how much of a VLA model is actually necessary for closed-loop control? In this work, we study architectural redundancy in VLA models by using transformer block removal as a controlled intervention. We introduce Drop-Then-Recovery (DTR), an analysis protocol that removes selected blocks from a pretrained VLA model and then fine-tunes the resulting model to measure whether the removed capacity was necessary for downstream control. To make this intervention reliable, we propose GateProbe, a one-shot virtual-gate sensitivity metric that ranks blocks by their contribution to the downstream action loss. Across multiple VLA architectures, manipulation benchmarks and even real-robot industrial scenarios, we find a strong asymmetry in post-removal recoverability: \textit{language backbones are highly redundant for standard robotic manipulation tasks, whereas vision and action pathways are substantially less tolerant to removal}. On LIBERO, removing half of the LLM blocks even improves OpenVLA-OFT from 95.0% to 98.3% under the same downstream fine-tuning budget, and retaining only two language blocks still recovers baseline-level performance. These results suggest that current VLA benchmarks may exert limited pressure on deep language grounding and compositional instruction understanding, and that future VLA architectures should allocate capacity more deliberately across language, vision, and action components. The code is available at https://github.com/s1ghhh/VLADrop.