欺骗性幽默:一个用于桥接虚构声明与幽默内容的多语言合成基准数据集
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content
March 20, 2025
作者: Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya
cs.AI
摘要
本文介绍了欺骗性幽默数据集(Deceptive Humor Dataset, DHD),这是一个用于研究源自虚假声明和误导信息的幽默的新型资源。在虚假信息泛滥的时代,理解幽默如何与欺骗交织至关重要。DHD包含由ChatGPT-4o模型生成的基于虚假叙事的幽默评论,这些评论融入了捏造的声明和操纵的信息。每个实例都标注了讽刺等级,从1级(微妙讽刺)到3级(高度讽刺),并分为五个不同的幽默类别:黑色幽默、讽刺、社会评论、文字游戏和荒诞。该数据集涵盖多种语言,包括英语、泰卢固语、印地语、卡纳达语、泰米尔语及其混合语言变体(Te-En、Hi-En、Ka-En、Ta-En),使其成为一个宝贵的多语言基准。通过引入DHD,我们为分析欺骗性语境中的幽默建立了一个结构化基础,为探索幽默不仅与误导信息互动,还影响其感知和传播的新研究方向铺平了道路。我们为该数据集建立了强大的基线,为未来研究提供了基准,并推动了欺骗性幽默检测模型的发展。
English
This paper presents the Deceptive Humor Dataset (DHD), a novel resource for
studying humor derived from fabricated claims and misinformation. In an era of
rampant misinformation, understanding how humor intertwines with deception is
essential. DHD consists of humor-infused comments generated from false
narratives, incorporating fabricated claims and manipulated information using
the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging
from 1 for subtle satire to 3 for high-level satire and classified into five
distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and
Absurdity. The dataset spans multiple languages including English, Telugu,
Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En,
Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we
establish a structured foundation for analyzing humor in deceptive contexts,
paving the way for a new research direction that explores how humor not only
interacts with misinformation but also influences its perception and spread. We
establish strong baselines for the proposed dataset, providing a foundation for
future research to benchmark and advance deceptive humor detection models.Summary
AI-Generated Summary