欺騙性幽默：一個用於連接虛構聲明與幽默內容的合成多語言基準數據集

摘要

本文介紹了欺騙性幽默數據集（Deceptive Humor Dataset, DHD），這是一個用於研究源自虛構聲明和錯誤信息的幽默的新穎資源。在錯誤信息泛濫的時代，理解幽默如何與欺騙交織至關重要。DHD包含由虛構敘事生成的幽默評論，這些敘事利用ChatGPT-4o模型融入了虛構聲明和操縱信息。每個實例都標註了諷刺等級，從1級（微妙諷刺）到3級（高度諷刺），並分為五個不同的幽默類別：黑色幽默、反諷、社會評論、文字遊戲和荒誕。該數據集涵蓋多種語言，包括英語、泰盧固語、印地語、卡納達語、泰米爾語及其混合變體（Te-En、Hi-En、Ka-En、Ta-En），使其成為一個有價值的多語言基準。通過引入DHD，我們為分析欺騙性語境中的幽默建立了結構化基礎，為探索幽默不僅如何與錯誤信息互動，還如何影響其感知和傳播的新研究方向鋪平了道路。我們為所提出的數據集建立了強基準，為未來研究提供了基準和推進欺騙性幽默檢測模型的基礎。

English

This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.