시각적 환영을 언어에 기반하기: 비전-언어 모델은 인간처럼 환영을 인지하는가?

초록

비전-언어 모델(VLMs)은 인간이 세계를 이해하는 방식을 모방하여 방대한 양의 데이터로 학습됩니다. 그러나 시각적 착각으로 알려진 것처럼, 인간의 현실 인식이 물리적 세계에 항상 충실한 것은 아닙니다. 이는 중요한 질문을 제기합니다: VLMs도 인간과 유사한 착각을 겪는가, 아니면 현실을 충실히 표현하도록 학습하는가? 이 질문을 탐구하기 위해, 우리는 다섯 가지 유형의 시각적 착각을 포함한 데이터셋을 구축하고, 최신 VLMs에서 시각적 착각을 검토하기 위한 네 가지 작업을 설계했습니다. 연구 결과, 전반적인 일치도는 낮지만, 더 큰 모델일수록 인간의 인식에 가깝고 시각적 착각에 더 취약한 것으로 나타났습니다. 우리의 데이터셋과 초기 연구 결과는 인간과 기계의 시각적 착각에 대한 이해를 증진시키고, 공유된 시각 세계를 인식하고 소통하는 데 있어 인간과 기계를 더 잘 조율할 수 있는 미래의 계산 모델을 위한 발판을 제공할 것입니다. 코드와 데이터는 https://github.com/vl-illusion/dataset에서 확인할 수 있습니다.

English

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset.

시각적 환영을 언어에 기반하기: 비전-언어 모델은 인간처럼 환영을 인지하는가?

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

초록

Support