欺瞞的ユーモア：作り話とユーモアコンテンツを橋渡しするための合成多言語ベンチマークデータセット

要旨

本論文では、虚偽の主張や誤情報から派生したユーモアを研究するための新たなリソースであるDeceptive Humor Dataset（DHD）を紹介する。誤情報が蔓延する時代において、ユーモアが欺瞞とどのように絡み合うかを理解することは極めて重要である。DHDは、ChatGPT-4oモデルを使用して虚偽の主張や操作された情報を組み込んだ誤った物語から生成されたユーモアを含むコメントで構成されている。各インスタンスは、微妙な風刺を示す1から高度な風刺を示す3までのSatire Levelでラベル付けされ、さらにDark Humor（ブラックユーモア）、Irony（皮肉）、Social Commentary（社会批評）、Wordplay（言葉遊び）、Absurdity（不条理）の5つの異なるHumor Categoryに分類される。このデータセットは、英語、テルグ語、ヒンディー語、カンナダ語、タミル語、およびそれらのコード混合バージョン（Te-En、Hi-En、Ka-En、Ta-En）を含む複数言語にまたがり、貴重な多言語ベンチマークとなっている。DHDを導入することで、欺瞞的文脈におけるユーモアを分析するための構造化された基盤を確立し、ユーモアが誤情報とどのように相互作用するだけでなく、その認識と拡散にどのように影響を与えるかを探る新たな研究方向性を切り開く。また、提案されたデータセットに対する強力なベースラインを確立し、今後の研究が欺瞞的ユーモア検出モデルをベンチマークし、進化させるための基盤を提供する。

English

This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.

欺瞞的ユーモア：作り話とユーモアコンテンツを橋渡しするための合成多言語ベンチマークデータセット

Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

要旨

Support