비전 언어 모델은 시각적으로 장애가 있습니다.

초록

시각 능력을 갖춘 대형 언어 모델(VLMs), 예를 들어 GPT-4o와 Gemini 1.5 Pro는 수많은 이미지-텍스트 애플리케이션을 구동하고 다양한 시각 이해 벤치마크에서 높은 점수를 기록하고 있습니다. 그러나 우리는 이러한 VLMs이 인간에게는 너무나 쉬운 7가지 시각적 작업에서 실패한다는 것을 발견했습니다. 예를 들어, (a) 두 원이 겹치는지 여부를 식별하거나, (b) 두 선이 교차하는지 여부를 판단하거나, (c) 단어에서 어떤 글자가 동그라미 쳐져 있는지 확인하거나, (d) 올림픽 로고와 같은 디자인에서 원의 개수를 세는 작업 등이 있습니다. 최신의 네 가지 VLMs의 충격적으로 낮은 성능은 그들의 시각 능력이 최선의 경우 근시안적인 사람이 미세한 세부 사항을 흐리게 보는 것과 같고, 최악의 경우 눈이 먼 지적인 사람이 교육받은 추측을 하는 것과 같다는 것을 시사합니다. 코드는 https://vlmsareblind.github.io/에서 확인할 수 있습니다.

English

Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

비전 언어 모델은 시각적으로 장애가 있습니다.

Vision language models are blind

초록

Support