將視覺錯覺基於語言：視覺語言模型是否像人類一樣感知錯覺？

摘要

視覺語言模型（VLMs）是通過人類捕捉的龐大數據訓練而成，模擬了我們對世界的理解。然而，人類對現實的感知並非始終忠實於物理世界，這種被稱為視覺錯覺。這帶出了一個關鍵問題：VLMs是否會像人類一樣產生錯覺，還是能忠實地學習表徵現實？為了探討這個問題，我們建立了一個包含五種類型視覺錯覺的數據集，並制定了四個任務來檢驗最先進的VLMs中的視覺錯覺。我們的研究結果表明，儘管整體對齊性較低，但較大的模型更接近人類感知並更容易受到視覺錯覺的影響。我們的數據集和初步研究結果將促進對人類和機器中的視覺錯覺有更好的理解，並為未來能更好地使人類和機器在感知和交流共享的視覺世界方面提供一個基礎。代碼和數據可在 https://github.com/vl-illusion/dataset 找到。

English

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset.

將視覺錯覺基於語言：視覺語言模型是否像人類一樣感知錯覺？

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

摘要

Support