X-MuTeST:可解释仇恨言论检测的多语言基准与新型大语言模型辅助解释框架
X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework
January 6, 2026
作者: Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh, Nagendra Kumar
cs.AI
摘要
社交媒体仇恨言论检测在准确性和可解释性方面面临挑战,尤其对于研究不足的印度语言。我们提出了一种新颖的可解释性引导训练框架X-MuTeST(可解释多语言仇恨言论检测),通过结合大语言模型的高层语义推理与传统注意力增强技术,实现仇恨言论检测。我们将研究扩展至印地语和泰卢固语,同时为英语提供基准级人工标注的词级归因依据以证明类别标签的合理性。X-MuTeST可解释性方法通过计算原始文本与单/双/三元语法单元预测概率的差异生成解释,最终解释结果取大语言模型解释与本方法的并集。实验表明,训练过程中引入人工标注依据能同步提升分类性能与可解释性。进一步将人工依据与本方法结合以优化模型注意力机制,可取得更显著改进。我们使用合理性指标(如Token-F1和IOU-F1)与忠实度指标(如覆盖度与充分度)评估可解释性。通过聚焦资源稀缺语言,本研究推动了跨语言环境的仇恨言论检测发展。数据集包含6,004条印地语、4,492条泰卢固语和6,334条英语样本的词级归因标注,数据与代码详见https://github.com/ziarehman30/X-MuTeST。
English
Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST