HateMirage: 偽りの憎悪と微妙なオンライン虐待を解読するための説明可能な多次元データセット

要旨

微妙で間接的なヘイトスピーチは、オンライン安全性研究において未開拓の課題であり、特に有害な意図が誤解を招く操作的言説に埋め込まれた場合に顕著である。既存のヘイトスピーチデータセットは主に顕著な毒性を捕捉するが、誤情報がヘイトを煽動または常態化させる微妙な手法を十分に反映していない。この課題に対処するため、虚偽または歪曲された言説から生じるヘイトに関する推論と説明可能性の研究を進めるために設計された新規データセット「HateMirage」（フェイクヘイトコメント集）を提案する。本データセットは、ファクトチェック源から広く否定された誤情報主張を特定し、関連するYouTube議論を追跡することで構築され、4,530件のユーザーコメントから構成される。各コメントは、対象者（影響を受ける主体）、意図（コメント背後にある動機や目的）、含意（潜在的社会影響）という3つの解釈可能な次元で注釈付けされている。トークンレベルまたは単一次元の推論を提供するHateXplainやHAREなどの従来の説明可能性データセットとは異なり、HateMirageは誤情報・危害・社会的結果の相互関係を捉える多次元説明フレームワークを導入する。ROUGE-L F1とSentence-BERT類似度を用いて複数のオープンソース言語モデルを評価した結果、説明の質はモデル規模単独よりも、事前学習の多様性と推論指向データに依存する可能性が示唆された。誤情報推論と危害帰属を組み合わせることで、HateMirageは解釈可能なヘイト検出と責任あるAI研究の新たなベンチマークを確立する。

English

Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.

HateMirage: 偽りの憎悪と微妙なオンライン虐待を解読するための説明可能な多次元データセット

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

要旨

Support