VLMsは言葉を必要とする：視覚言語モデルは視覚的詳細を無視し、意味的アンカーを優先する

要旨

視覚言語モデル（VLM）は、多様なマルチモーダルタスクにおいて印象的な性能を達成している。しかし、細粒度の視覚的認識を要求する一部のタスクでは、必要な情報が内部表現に存在しているにもかかわらず、しばしば失敗する。本研究では、この乖離が、視覚情報をテキスト空間へ移行することに焦点を当てた狭義の訓練パイプラインに起因することを明らかにする。その結果、VLMは言語空間内の既知概念へマッピング可能な視覚的実体についてのみ推論でき、視覚的対応付けや新規視覚的実体に関する推論といった視覚中心のタスクを十分にサポートできない。したがって、VLMは、テキスト表現へマッピングできない視覚的実体に対する脆弱で虚構的な記述に依存せざるを得ず、いくつかの重要なマルチモーダル能力が大幅に制限されている。我々は、2つの画像間で一致する実体を検出する必要がある視覚的対応付けタスクを通じてこの挙動を検証する。意味的対応、形状対応、顔対応タスクにわたるテストにより、関連する実体が言語で命名可能な場合、VLMは命名不可能な場合よりもはるかに優れた性能を発揮することを見出した。機序的には、Logit Lens分析により、VLMが命名可能な実体に明示的に意味的ラベルを割り当て、命名不可能な実体と比較してより一意な対応トークンを表面化させることが確認された。さらに、未知の実体に対して完全に任意の名称を教えることで性能が向上する一方、タスク固有のファインチューニングは言語事前知識に依存せず、より強力な汎化を実現することを示す。我々の発見は、視覚タスクにおける現在のVLMの失敗が、マルチモーダルアーキテクチャの根本的制約ではなく、訓練から学習された近道解法を反映していることを示唆する。

English

Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

VLMsは言葉を必要とする：視覚言語モデルは視覚的詳細を無視し、意味的アンカーを優先する

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

要旨

Support