ChatPaper.aiChatPaper

视觉语言模型是盲目的。

Vision language models are blind

July 9, 2024
作者: Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
cs.AI

摘要

具有视觉能力的大型语言模型(VLMs),例如GPT-4o和Gemini 1.5 Pro,正在驱动无数的图像文本应用,并在许多视觉理解基准测试中得分很高。然而,我们发现VLMs在许多对人类来说极其简单的7个视觉任务上失败了,例如识别(a)两个圆是否重叠;(b)两条线是否相交;(c)一个单词中哪个字母被圈出;以及(d)计算类似奥林匹克标志中圆圈的数量。这四个最先进的VLMs的惊人糟糕表现表明,它们的视觉能力充其量就像一个患有近视的人看到细节模糊,最坏的情况下就像一个聪明的盲人在做有根据的猜测。代码可在以下网址找到:https://vlmsareblind.github.io/
English
Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

Summary

AI-Generated Summary

PDF8317November 28, 2024