VLM^2-Bench：深入探討視覺語言模型如何隱含地連結顯式匹配的視覺線索

摘要

在日常生活中，視覺上連結匹配線索是一項至關重要的能力，例如根據線索在多張照片中識別出同一個人，即使並不知道他們的身份。儘管視覺語言模型（VLMs）擁有廣泛的知識，但它們是否能夠執行這項基本任務仍很大程度上未被探索。為此，我們引入了VLM^2-Bench，這是一個旨在評估VLMs是否能夠視覺連結匹配線索的基準測試，包含9個子任務和超過3,000個測試案例。通過對八個開源VLMs和GPT-4o的全面評估，以及對各種語言側和視覺側提示方法的進一步分析，我們得出了八個關鍵發現。我們識別出模型在連結視覺線索能力上的關鍵挑戰，突顯出一個顯著的性能差距，即使GPT-4o也落後人類34.80%。基於這些洞察，我們主張：(i) 增強核心視覺能力以提高適應性並減少對先驗知識的依賴，(ii) 建立更清晰的原則來整合基於語言的推理於視覺中心任務中，以避免不必要的偏見，以及(iii) 轉變視覺文本訓練範式，以培養模型獨立結構化和推斷視覺線索之間關係的能力。

English

Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM^2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.

VLM^2-Bench：深入探討視覺語言模型如何隱含地連結顯式匹配的視覺線索

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

摘要

Support