視覺語言模型是盲目的

摘要

具有視覺能力的大型語言模型（VLMs），例如GPT-4o和Gemini 1.5 Pro，正在驅動無數的圖像文字應用程序，在許多視覺理解基準上得分很高。然而，我們發現VLMs在許多對人類來說極其簡單的視覺任務上表現不佳，例如識別（a）兩個圓是否重疊；（b）兩條線是否相交；（c）單詞中哪個字母被圈起來；以及（d）計算奧運會標誌中的圓圈數量。這四個最先進的VLMs的表現令人震驚地糟糕，表明它們的視覺能力最多只能被比作視力不佳的人看到細節模糊，最糟糕的情況下，就像一個聰明的盲人在做出合理的猜測。代碼可在以下網址找到：https://vlmsareblind.github.io/

English

Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

視覺語言模型是盲目的

Vision language models are blind

摘要

Support