VLM^2-Bench: 視覚言語モデルが明示的マッチング視覚手がかりを暗黙的にリンクする能力の詳細な検証

要旨

視覚的に一致する手がかりを関連付ける能力は、日常生活において極めて重要です。例えば、特定の人物が誰であるかを知らなくても、複数の写真からその人物を手がかりに特定するような場面です。視覚言語モデル（VLM）は膨大な知識を有していますが、この基本的なタスクを実行できるかどうかはほとんど検証されていません。この問題に対処するため、我々はVLM^2-Benchを導入しました。これはVLMが視覚的に一致する手がかりを関連付けられるかを評価するためのベンチマークで、9つのサブタスクと3,000以上のテストケースを備えています。8つのオープンソースVLMとGPT-4oに対する包括的な評価、および言語側と視覚側のプロンプト手法の詳細な分析を通じて、合計8つの重要な知見が得られました。我々は、モデルが視覚的手がかりを関連付ける能力における重大な課題を特定し、GPT-4oでさえ人間に34.80%遅れをとるという大きな性能ギャップを明らかにしました。これらの洞察に基づき、我々は以下の提言を行います：(i) 適応性を高め、事前知識への依存を減らすために、コアとなる視覚能力を強化すること、(ii) 視覚中心のタスクにおいて言語ベースの推論を統合するための明確な原則を確立し、不必要なバイアスを防ぐこと、(iii) 視覚テキストのトレーニングパラダイムを、モデルが視覚的手がかり間の関係を独立して構造化し推論する能力を育む方向に転換することです。

English

Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM^2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.

VLM^2-Bench: 視覚言語モデルが明示的マッチング視覚手がかりを暗黙的にリンクする能力の詳細な検証

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

要旨

Support