ChatPaper.aiChatPaper

审而后行:增强视觉-语言-行为模型中视觉基础表征能力

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

March 16, 2026
作者: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
cs.AI

摘要

视觉-语言-动作(VLA)模型近年来已成为机器人操控领域的重要范式,其动作预测的可靠性关键取决于对语言指令条件下视觉观测的精准解析与融合。尽管现有研究致力于增强VLA模型的视觉能力,但多数方法将大语言模型主干视为黑箱,难以揭示视觉信息如何被 grounding 至动作生成的过程。为此,我们对多种动作生成范式下的VLA模型展开系统性分析,发现动作生成过程中深层网络对视觉标记的敏感度会逐层递减。基于此发现,我们提出基于视觉-语言混合Transformer(VL-MoT)框架的DeepVision-VLA。该框架通过视觉基础模型与VLA主干的共享注意力机制,将视觉专家模型的多层级特征注入VLA主干的深层网络,从而增强复杂精细操控任务的视觉表征能力。此外,我们引入动作引导的视觉剪枝(AGVP)方法,利用浅层注意力保留任务相关视觉标记并剔除冗余信息,以最小计算开销强化关键视觉线索。实验表明,DeepVision-VLA在仿真与真实场景任务中分别以9.0%和7.5%的优势超越现有最优方法,为视觉增强型VLA模型的设计提供了新思路。
English
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose DeepVision-VLA, built on a Vision-Language Mixture-of-Transformers (VL-MoT) framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce Action-Guided Visual Pruning (AGVP), which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
PDF203March 20, 2026