显而易见却未被察觉：视觉语言模型忽视了其视觉表征

摘要

语言为指定和评估视觉任务性能提供了一个自然的接口。为实现这一可能性，视觉语言模型（VLMs）必须成功整合视觉与语言信息。本研究通过将VLMs与其视觉编码器的直接输出进行比较，以理解它们跨模态整合的能力。在一系列以视觉为中心的基准测试（如深度估计、对应关系）中，我们发现VLMs的表现显著低于其视觉编码器，几乎降至随机猜测的水平。我们通过一系列分析，从整个VLM的视角探讨了这些结果，具体包括：1）视觉表征的退化，2）对任务提示的脆弱性，以及3）语言模型在解决任务中的作用。我们发现，执行这些以视觉为中心的任务的瓶颈在于第三点；VLMs未能有效利用整个模型中易于获取的视觉信息，并且继承了大型语言模型（LLM）中的语言先验。本研究有助于诊断开源VLMs的失败模式，并提出了一系列评估方法，为未来深入探究VLMs中的视觉理解提供了有价值的参考。

English

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

显而易见却未被察觉：视觉语言模型忽视了其视觉表征

Hidden in plain sight: VLMs overlook their visual representations

摘要

Support