视觉语言模型依赖词汇:视觉细节遭忽视,语义锚点成主导
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
April 2, 2026
作者: Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong
cs.AI
摘要
视觉语言模型(VLMs)在广泛的多模态任务中展现出卓越性能。然而,在需要细粒度视觉感知的任务上,即使所需信息已存在于其内部表征中,这类模型仍常常表现不佳。本研究揭示,这一能力鸿沟源于其狭窄的训练流程——该流程侧重于将视觉信息迁移至文本空间。因此,VLMs仅能对可映射到语言空间已知概念的视觉实体进行推理,导致视觉对应性任务及涉及新颖视觉实体的推理等以视觉为核心的任务难以得到有效支持。这种局限性严重制约了VLMs的多模态能力,因为它们不得不依赖对无法映射到文本表征的视觉实体进行脆弱且易产生幻觉的文本描述。我们通过视觉对应性任务验证了这一现象:当要求VLMs检测两幅图像中的匹配实体时,在语义、形状和人脸对应性任务上的测试表明,模型对可命名实体的处理效果远优于不可命名实体。机制分析显示,Logit Lens技术证实VLMs会为可命名实体显式分配语义标签,并生成比不可命名实体更独特的对应标记。进一步实验表明,为未知实体赋予任意命名能提升性能,而针对特定任务的微调可在不依赖语言先验的情况下实现更强的泛化能力。我们的研究结果表明,当前VLMs在视觉任务上的失败反映了其训练过程中习得的捷径策略,而非多模态架构的根本性局限。
English
Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.