ChatPaper.aiChatPaper

勿使视觉语言模型失明:面向分布外泛化的视觉表征对齐

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

October 29, 2025
作者: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
cs.AI

摘要

视觉-语言-动作(VLA)模型日益成功的核心在于:经过预训练的视觉-语言模型(VLM)能为智能体提供可迁移的世界知识与视觉-语言(VL) grounding 能力,为具有更广泛泛化能力的动作模型奠定基础。然而当这些VLM被适配到动作模态时,其原有的视觉-语言表征与知识能在多大程度上得以保留仍不明确。本文通过系统研究VLA微调过程中的表征保持性,发现简单的动作微调会导致视觉表征的退化。为量化分析这一现象,我们探测了VLA模型的隐藏表征并解析注意力图谱,进而设计了一套针对性任务与方法,通过对比VLA模型与其对应VLM的表现,分离出动作微调对VL能力的影响。我们还评估了多种视觉表征对齐策略,提出一种简单有效的方法来缓解表征退化问题,并显著提升模型在分布外(OOD)场景的泛化能力。综合而言,本研究阐明了动作微调与VL表征退化之间的权衡关系,并提出了恢复继承性VL能力的实用方案。代码已开源:https://blind-vla-paper.github.io
English
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
PDF953December 2, 2025