視覚的錯覚と言語の基盤化：視覚-言語モデルは人間のように錯覚を認識するか？

要旨

視覚言語モデル（VLMs）は、人間が世界を理解する方法を模倣して収集された膨大な量のデータで訓練されています。しかし、視覚的錯覚として知られるように、人間の現実の知覚は必ずしも物理的世界に忠実ではありません。これにより、重要な疑問が生じます：VLMsは人間と同様の錯覚を持つのか、それとも現実を忠実に表現するように学習するのか？この疑問を探るため、私たちは5種類の視覚的錯覚を含むデータセットを構築し、最先端のVLMsにおける視覚的錯覚を検証するための4つのタスクを策定しました。その結果、全体的な整合性は低いものの、より大規模なモデルほど人間の知覚に近く、視覚的錯覚に対してより脆弱であることが明らかになりました。私たちのデータセットと初期の知見は、人間と機械における視覚的錯覚の理解を促進し、共有する視覚世界を認識し伝達する上で人間と機械をより良く整合させるための将来の計算モデルへの足がかりを提供します。コードとデータはhttps://github.com/vl-illusion/datasetで公開されています。

English

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset.

視覚的錯覚と言語の基盤化：視覚-言語モデルは人間のように錯覚を認識するか？

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

要旨

Support