ChatPaper.aiChatPaper

VLM^2-Bench:深入探討視覺語言模型如何隱含地連結顯式匹配的視覺線索

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

February 17, 2025
作者: Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
cs.AI

摘要

在日常生活中,視覺上連結匹配線索是一項至關重要的能力,例如根據線索在多張照片中識別出同一個人,即使並不知道他們的身份。儘管視覺語言模型(VLMs)擁有廣泛的知識,但它們是否能夠執行這項基本任務仍很大程度上未被探索。為此,我們引入了VLM^2-Bench,這是一個旨在評估VLMs是否能夠視覺連結匹配線索的基準測試,包含9個子任務和超過3,000個測試案例。通過對八個開源VLMs和GPT-4o的全面評估,以及對各種語言側和視覺側提示方法的進一步分析,我們得出了八個關鍵發現。我們識別出模型在連結視覺線索能力上的關鍵挑戰,突顯出一個顯著的性能差距,即使GPT-4o也落後人類34.80%。基於這些洞察,我們主張:(i) 增強核心視覺能力以提高適應性並減少對先驗知識的依賴,(ii) 建立更清晰的原則來整合基於語言的推理於視覺中心任務中,以避免不必要的偏見,以及(iii) 轉變視覺文本訓練範式,以培養模型獨立結構化和推斷視覺線索之間關係的能力。
English
Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM^2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.

Summary

AI-Generated Summary

PDF302February 24, 2025