视觉语言模型需倚重词汇：视觉细节遭忽视，语义锚点成主导

摘要

視覺語言模型（VLMs）在多模態任務領域展現出卓越性能，但在某些需要細粒度視覺感知的任務中，即使所需信息已存在於其內部表徵，模型仍頻繁失效。本研究揭示這一缺陷源於其狹窄的訓練流程——該流程專注於將視覺信息遷移至文本空間，導致VLMs僅能推理可映射到語言空間已知概念的視覺實體，致使視覺對應任務及新穎視覺實體推理等視覺核心任務難以得到有效支持。因此，由於依賴於無法映射到文本表徵的視覺實體之脆弱且虛構的文本描述，VLMs在多項關鍵多模態能力上存在嚴重局限。我們通過視覺對應任務（要求模型識別兩幅圖像中的匹配實體）驗證此現象：在語義對應、形狀對應和人臉對應任務中，當相關實體可被語言命名時，VLMs的表現遠超無法命名的情況。機制分析表明，Logit Lens檢測證實VLMs會對可命名實體顯式賦予語義標籤，並比不可命名實體生成更多獨特的對應標記。進一步實驗顯示，為未知實體賦予任意命名可提升性能，而針對特定任務的微調更能實現不依賴語言先驗的強泛化能力。我們的發現表明，當前VLMs在視覺任務上的失敗反映的是訓練中習得的捷徑策略，而非多模態架構的根本性局限。

English

Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

视觉语言模型需倚重词汇：视觉细节遭忽视，语义锚点成主导

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

摘要

Support