HateMirage：用於解構虛假仇恨與隱性網絡暴力的可解釋多維度數據集

摘要

微妙且间接的仇恨言论仍是网络安全研究中尚未充分探索的挑战，尤其当恶意意图潜藏于误导性或操纵性叙事之中时。现有仇恨言论数据集主要捕捉显性毒性内容，未能充分体现错误信息煽动或常态化仇恨的微妙方式。为填补这一空白，我们推出HateMirage——一个由虚假仇恨评论构成的新型数据集，旨在推进针对虚假或扭曲叙事所引发仇恨的推理与可解释性研究。该数据集通过核查事实来源机构已广泛辟谣的错误主张，并追踪相关YouTube讨论构建而成，最终收录4,530条用户评论。每条评论均沿三个可解释维度进行标注：目标（受影响群体）、意图（评论背后的潜在动机或目的）及影响（潜在社会后果）。与HateXplain和HARE等仅提供词元级或单维度推理的现有可解释数据集不同，HateMirage引入了多维解释框架，能捕捉错误信息、危害与社会后果之间的相互作用。我们使用ROUGE-L F1和Sentence-BERT相似度对多个开源语言模型进行基准测试，评估解释连贯性。结果表明解释质量可能更依赖于预训练数据的多样性及面向推理的数据构建，而非仅取决于模型规模。通过将错误信息推理与危害归因相结合，HateMirage为可解释仇恨检测与负责任AI研究设立了新基准。

English

Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.

HateMirage：用於解構虛假仇恨與隱性網絡暴力的可解釋多維度數據集

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

摘要

Support