ChatPaper.aiChatPaper

勿使視覺語言模型失明:對齊視覺表徵以實現分布外泛化

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

October 29, 2025
作者: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
cs.AI

摘要

視覺-語言-動作(VLA)模型日益成功的核心在於:經過預訓練的視覺-語言模型(VLM)能為智能體提供可遷移的世界知識與視覺-語言(VL)基礎能力,從而為具備更廣泛泛化能力的動作模型奠定基礎。然而當這些VLM被適配至動作模態時,其原有的VL表徵與知識能在多大程度上得以保留,目前尚不明確。本研究針對VLA微調過程中的表徵保持性進行系統性分析,發現直接進行動作微調會導致視覺表徵的退化。為量化表徵變化,我們不僅探測了VLA模型的隱藏表徵並分析其注意力圖譜,更設計了一套對比任務與方法,將VLA模型與對應的VLM進行對照,以分離動作微調對VL能力的影響。我們進一步評估了多種視覺表徵對齊策略,並提出一種簡潔有效的方法,既能緩解表徵退化現象,又能提升模型在分佈外(OOD)場景的泛化性能。綜合而言,本研究闡明了動作微調與VL表徵退化之間的權衡關係,並提出了恢復繼承VL能力的實用方法。程式碼已公開於:https://blind-vla-paper.github.io
English
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
PDF953December 2, 2025