隠された視覚表現：VLMはその視覚的表現を見落とす

要旨

言語は、視覚タスクの性能を指定し評価するための自然なインターフェースを提供する。この可能性を実現するためには、視覚言語モデル（VLM）が視覚情報と言語情報を効果的に統合する必要がある。本研究では、VLMとその視覚エンコーダの直接的な読み取りを比較し、これらのモダリティを統合する能力を理解する。一連の視覚中心のベンチマーク（例：深度推定、対応付け）を通じて、VLMが視覚エンコーダよりも大幅に性能が劣り、ほぼ偶然のレベルにまで低下することがわかった。これらの結果を、VLM全体にわたる一連の分析を通じて調査する。具体的には、1）視覚表現の劣化、2）タスクプロンプトに対する脆弱性、3）タスク解決における言語モデルの役割である。これらの視覚中心のタスクにおけるボトルネックは、この3番目のカテゴリーにあることがわかった。VLMは、モデル全体を通じて容易にアクセス可能な視覚情報を効果的に活用しておらず、LLMに存在する言語の事前知識を継承している。本研究は、オープンソースのVLMの失敗モードを診断し、VLM内の視覚理解に関する将来の調査に有用な一連の評価を提示する。

English

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

隠された視覚表現：VLMはその視覚的表現を見落とす

Hidden in plain sight: VLMs overlook their visual representations

要旨

Support