눈에 띄지 않게 숨겨진 것: 시각-언어 모델은 자신의 시각적 표현을 간과한다

초록

언어는 시각적 작업의 성능을 명시하고 평가하기 위한 자연스러운 인터페이스를 제공한다. 이러한 가능성을 실현하기 위해서는 시각 언어 모델(VLMs)이 시각적 정보와 언어적 정보를 성공적으로 통합해야 한다. 본 연구는 VLMs이 이러한 양상을 통합하는 능력을 이해하기 위해 VLMs과 그 시각적 인코더의 직접적인 판독을 비교한다. 일련의 시각 중심 벤치마크(예: 깊이 추정, 대응 관계)에서 VLMs은 시각적 인코더보다 상당히 낮은 성능을 보이며, 거의 무작위 수준의 성능으로 떨어지는 것을 확인하였다. 이러한 결과를 VLMs 전체에 걸친 일련의 분석을 통해 조사하였다: 즉 1) 시각 표현의 저하, 2) 작업 프롬프트에 대한 취약성, 3) 작업 해결에서 언어 모델의 역할. 이러한 시각 중심 작업을 수행하는 데 있어 병목 현상은 세 번째 범주에 있음을 발견하였다; VLMs은 모델 전체에서 쉽게 접근할 수 있는 시각적 정보를 효과적으로 사용하지 못하며, LLM에 존재하는 언어적 사전 지식을 상속받는다. 본 연구는 오픈소스 VLMs의 실패 모드를 진단하고, VLMs 내에서 시각적 이해에 대한 향후 연구에 유용한 일련의 평가를 제시한다.

English

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

눈에 띄지 않게 숨겨진 것: 시각-언어 모델은 자신의 시각적 표현을 간과한다

Hidden in plain sight: VLMs overlook their visual representations

초록

Support