審而後行:提升視覺語言行動模型中視覺基礎表徵能力
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
March 16, 2026
作者: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
cs.AI
摘要
視覺-語言-動作(VLA)模型近期已成為機器人操作領域中極具前景的研究範式,其動作預測的可靠性關鍵取決於對語言指令條件下視覺觀測的準確解析與整合。儘管現有研究試圖增強VLA模型的視覺能力,但多數方法將大型語言模型(LLM)骨幹視為黑箱,難以揭示視覺資訊如何被融入動作生成的過程。為此,我們對多種VLA模型在不同動作生成範式下進行系統性分析,發現模型在生成動作時對視覺標記的敏感度會隨網絡層數加深而遞減。基於此觀察,我們提出DeepVision-VLA模型,其建構於視覺-語言混合變換器(VL-MoT)框架之上。該框架實現視覺基礎模型與VLA骨幹間的共享注意力機制,將來自視覺專家的多層級視覺特徵注入VLA骨幹的深層網絡,從而強化精確與複雜操作所需的視覺表徵。此外,我們引入動作引導視覺修剪(AGVP)機制,利用淺層注意力篩除無關視覺標記並保留任務相關內容,以最小計算開銷強化操作所需的關鍵視覺線索。DeepVision-VLA在模擬與真實場景任務中分別以9.0%和7.5%的優勢超越現有最佳方法,為視覺增強型VLA模型的設計提供了新思路。
English
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose DeepVision-VLA, built on a Vision-Language Mixture-of-Transformers (VL-MoT) framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce Action-Guided Visual Pruning (AGVP), which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.