X-MuTeST：可解释仇恨言论检测的多语言基准与新型大语言模型咨询式解释框架

摘要

社交媒体上的仇恨言论检测在准确性和可解释性方面均面临挑战，尤其对于研究不足的印度语言而言。我们提出了一种新颖的可解释性引导训练框架X-MuTeST（可解释多语言仇恨言论检测），通过结合大语言模型的高层语义推理与传统注意力增强技术，实现仇恨言论检测。我们将该研究扩展至印地语和泰卢固语（与英语并列），为每个单词提供基准级人工标注的归因依据以证明类别标签的合理性。X-MuTeST可解释性方法通过计算原始文本与单字组、双字组、三字组的预测概率差异生成解释，最终解释结果取大语言模型解释与X-MuTeST解释的并集。研究表明，在训练过程中利用人工标注的归因依据能同步提升分类性能与可解释性。此外，将人工归因与我们的可解释性方法结合以优化模型注意力机制，可带来进一步改善。我们采用合理性指标（如Token-F1和IOU-F1）与忠实度指标（如 Comprehensiveness 和 Sufficiency）评估可解释性。通过聚焦资源匮乏语言，本研究推动了跨语言环境的仇恨言论检测进展。我们的数据集包含6,004个印地语样本、4,492个泰卢固语样本和6,334个英语样本的词级归因标注。数据和代码详见https://github.com/ziarehman30/X-MuTeST。

English

Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST

X-MuTeST：可解释仇恨言论检测的多语言基准与新型大语言模型咨询式解释框架

X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

摘要

Support