Nascosto in piena vista: i VLMs trascurano le loro rappresentazioni visive

Abstract

Il linguaggio fornisce un'interfaccia naturale per specificare e valutare le prestazioni su compiti visivi. Per realizzare questa possibilità, i modelli linguistico-visivi (VLMs) devono integrare con successo le informazioni visive e linguistiche. Il nostro lavoro confronta i VLMs con una lettura diretta dei loro encoder visivi per comprendere la loro capacità di integrare queste modalità. Attraverso una serie di benchmark centrati sulla visione (ad esempio, stima della profondità, corrispondenza), scopriamo che i VLMs performano sostanzialmente peggio rispetto ai loro encoder visivi, scendendo a livelli vicini al caso. Investigiamo questi risultati attraverso una serie di analisi sull'intero VLM: in particolare 1) il degrado delle rappresentazioni visive, 2) la fragilità rispetto al prompt del compito, e 3) il ruolo del modello linguistico nel risolvere il compito. Troviamo che il collo di bottiglia nell'esecuzione di questi compiti centrati sulla visione risiede in questa terza categoria; i VLMs non stanno utilizzando efficacemente le informazioni visive facilmente accessibili in tutto il modello, e ereditano i prior linguistici presenti nel LLM. Il nostro lavoro aiuta a diagnosticare le modalità di fallimento dei VLMs open-source e presenta una serie di valutazioni utili per future indagini sulla comprensione visiva all'interno dei VLMs.

English

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

Nascosto in piena vista: i VLMs trascurano le loro rappresentazioni visive

Hidden in plain sight: VLMs overlook their visual representations

Abstract

Support