行動前の観察：視覚言語行動モデルのための視覚基盤表現の強化

要旨

Vision-Language-Action（VLA）モデルは近年、ロボットマニピュレーションにおける有望なパラダイムとして登場しており、信頼性の高い動作予測には言語指示に条件付けされた視覚観測の正確な解釈と統合が極めて重要である。最近の研究ではVLAモデルの視覚能力向上が図られているが、多くの手法はLLMバックボーンをブラックボックスとして扱い、視覚情報が動作生成にどのように接地されるかに関する知見は限られている。そこで我々は、複数の動作生成パラダイムにわたるVLAモデルの体系的分析を実施し、動作生成時に視覚トークンへの感度が深い層で段階的に低下する現象を観察した。この観察に基づき、我々はVision-Language Mixture-of-Transformers（VL-MoT）フレームワーク上に構築したDeepVision-VLAを提案する。本フレームームは視覚基盤モデルとVLAバックボーンの間で注意機構を共有し、視覚専門家から得た多段階の視覚特徴をVLAバックボーンの深い層に注入することで、精密かつ複雑なマニピュレーションのための視覚表現を強化する。さらに、浅い層の注意機構を活用して無関係な視覚トークンを剪定しつつ課題関連トークンを保持するAction-Guided Visual Pruning（AGVP）を導入し、最小限の計算コストでマニピュレーションに不可欠な視覚手がかりを強化する。DeepVision-VLAはシミュレーション課題と実世界課題において従来の最先端手法をそれぞれ9.0％、7.5％上回り、視覚強化型VLAモデルの設計に新たな知見を提供する。

English

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose DeepVision-VLA, built on a Vision-Language Mixture-of-Transformers (VL-MoT) framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce Action-Guided Visual Pruning (AGVP), which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

行動前の観察：視覚言語行動モデルのための視覚基盤表現の強化

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

要旨

Support