藏于显而易见之处：视觉语言模型忽视了其视觉表征

摘要

語言提供了一個自然的介面來指定和評估視覺任務的表現。為了實現這一可能性，視覺語言模型（VLMs）必須成功整合視覺與語言資訊。我們的工作將VLMs與其視覺編碼器的直接讀取進行比較，以理解它們跨模態整合的能力。在一系列以視覺為中心的基準測試（例如深度估計、對應關係）中，我們發現VLMs的表現遠遜於其視覺編碼器，降至接近隨機猜測的水平。我們通過對整個VLM的一系列分析來探討這些結果，即：1）視覺表徵的退化，2）對任務提示的脆弱性，以及3）語言模型在解決任務中的作用。我們發現，執行這些以視覺為中心任務的瓶頸在於第三類；VLMs並未有效利用整個模型中易於獲取的視覺資訊，並且它們繼承了大型語言模型（LLM）中的語言先驗。我們的工作有助於診斷開源VLMs的失敗模式，並提出了一系列評估方法，對未來研究VLMs中的視覺理解具有重要價值。

English

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

藏于显而易见之处：视觉语言模型忽视了其视觉表征

Hidden in plain sight: VLMs overlook their visual representations

摘要

Support