ChatPaper.aiChatPaper

視覺語言模型是盲目的

Vision language models are blind

July 9, 2024
作者: Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
cs.AI

摘要

具有視覺能力的大型語言模型(VLMs),例如GPT-4o和Gemini 1.5 Pro,正在驅動無數的圖像文字應用程序,在許多視覺理解基準上得分很高。然而,我們發現VLMs在許多對人類來說極其簡單的視覺任務上表現不佳,例如識別(a)兩個圓是否重疊;(b)兩條線是否相交;(c)單詞中哪個字母被圈起來;以及(d)計算奧運會標誌中的圓圈數量。這四個最先進的VLMs的表現令人震驚地糟糕,表明它們的視覺能力最多只能被比作視力不佳的人看到細節模糊,最糟糕的情況下,就像一個聰明的盲人在做出合理的猜測。代碼可在以下網址找到:https://vlmsareblind.github.io/
English
Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

Summary

AI-Generated Summary

PDF8317November 28, 2024