視覺謎題:針對大視覺和語言模型的常識和世界知識挑戰
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
July 28, 2024
作者: Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici
cs.AI
摘要
想像觀察到有人挠胳膊;要理解原因,需要額外的背景資料。然而,如果附近發現一隻蚊子,立即就能提供對於這人不適的合理解釋,因此減輕了進一步資訊的需求。這個例子說明了微妙的視覺線索如何挑戰我們的認知能力,展示了解讀視覺情境的複雜性。為了研究這些能力,我們提出了「視覺謎題」,這是一個旨在測試視覺和語言模型對需要常識和世界知識的視覺謎題的基準。這個基準包括 400 個視覺謎題,每個都有一個由各種文本到圖像模型創建的獨特圖像、問題、基本答案、文本提示和歸因。人類評估顯示現有模型明顯落後於人類表現,人類表現準確率為 82\%,Gemini-Pro-1.5 領先,準確率為 40\%。我們的基準配備了自動評估任務,以使評估可擴展。這些發現強調了「視覺謎題」作為一個有價值的資源,可增強視覺和語言模型在解讀複雜視覺情境方面的能力。
English
Imagine observing someone scratching their arm; to understand why, additional
context would be necessary. However, spotting a mosquito nearby would
immediately offer a likely explanation for the person's discomfort, thereby
alleviating the need for further information. This example illustrates how
subtle visual cues can challenge our cognitive skills and demonstrates the
complexity of interpreting visual scenarios. To study these skills, we present
Visual Riddles, a benchmark aimed to test vision and language models on visual
riddles requiring commonsense and world knowledge. The benchmark comprises 400
visual riddles, each featuring a unique image created by a variety of
text-to-image models, question, ground-truth answer, textual hint, and
attribution. Human evaluation reveals that existing models lag significantly
behind human performance, which is at 82\% accuracy, with Gemini-Pro-1.5
leading with 40\% accuracy. Our benchmark comes with automatic evaluation tasks
to make assessment scalable. These findings underscore the potential of Visual
Riddles as a valuable resource for enhancing vision and language models'
capabilities in interpreting complex visual scenarios.Summary
AI-Generated Summary