YesBut:一个高质量的带注释的多模态数据集,用于评估视觉-语言模型对讽刺理解能力的表现。
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models
September 20, 2024
作者: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly
cs.AI
摘要
即使对于当前的视觉-语言模型来说,理解讽刺和幽默也是一项具有挑战性的任务。在本文中,我们提出了具有挑战性的任务,包括讽刺图像检测(检测图像是否具有讽刺性)、理解(生成图像具有讽刺性的原因)和完成(在给定图像的一半的情况下,从两个给定选项中选择另一半,使得完整图像具有讽刺性),并发布了一个高质量的数据集YesBut,包含2547张图像,其中1084张具有讽刺性,1463张非讽刺性,涵盖不同的艺术风格,以评估这些任务。数据集中的每个讽刺图像描绘了一个正常情景,以及一个有趣或讽刺的冲突情景。尽管当前的视觉-语言模型在多模态任务(如视觉问答和图像字幕生成)上取得了成功,但我们的基准实验表明,这些模型在Zero-Shot设置下对YesBut数据集上提出的任务表现不佳,无论是在自动化评估还是人工评估方面。此外,我们还发布了一个包含119张真实讽刺照片的数据集,供进一步研究使用。数据集和代码可在https://github.com/abhi1nandy2/yesbut_dataset 上获取。
English
Understanding satire and humor is a challenging task for even current
Vision-Language models. In this paper, we propose the challenging tasks of
Satirical Image Detection (detecting whether an image is satirical),
Understanding (generating the reason behind the image being satirical), and
Completion (given one half of the image, selecting the other half from 2 given
options, such that the complete image is satirical) and release a high-quality
dataset YesBut, consisting of 2547 images, 1084 satirical and 1463
non-satirical, containing different artistic styles, to evaluate those tasks.
Each satirical image in the dataset depicts a normal scenario, along with a
conflicting scenario which is funny or ironic. Despite the success of current
Vision-Language Models on multimodal tasks such as Visual QA and Image
Captioning, our benchmarking experiments show that such models perform poorly
on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both
automated as well as human evaluation. Additionally, we release a dataset of
119 real, satirical photographs for further research. The dataset and code are
available at https://github.com/abhi1nandy2/yesbut_dataset.Summary
AI-Generated Summary