ChatPaper.aiChatPaper

视觉谜题:一个针对大型视觉和语言模型的常识和世界知识挑战

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

July 28, 2024
作者: Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici
cs.AI

摘要

想象观察到有人挠胳膊;要理解其原因,需要额外的背景信息。然而,如果附近发现了一只蚊子,立即就能为这人的不适提供一个可能的解释,从而减轻了进一步信息的需求。这个例子说明了微妙的视觉线索如何挑战我们的认知能力,并展示了解释视觉场景的复杂性。为了研究这些技能,我们提出了“视觉谜题”,这是一个旨在测试视觉和语言模型在需要常识和世界知识的视觉谜题上的基准。该基准包括400个视觉谜题,每个谜题都包含一个由各种文本到图像模型创建的独特图像、问题、地面真实答案、文本提示和归因。人类评估表明,现有模型明显落后于人类表现,人类表现的准确率为82%,Gemini-Pro-1.5领先,准确率为40%。我们的基准配备了自动评估任务,以使评估具有可扩展性。这些发现强调了“视觉谜题”作为一个有价值的资源,可以增强视觉和语言模型在解释复杂视觉场景方面的能力。
English
Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82\% accuracy, with Gemini-Pro-1.5 leading with 40\% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.

Summary

AI-Generated Summary

PDF232November 28, 2024