评估大型语言模型在检测反犹太主义中的应用
Evaluating Large Language Models for Detecting Antisemitism
September 22, 2025
作者: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
cs.AI
摘要
检测仇恨内容是一项既具挑战性又至关重要的任务。自动化工具,如机器学习模型,能够提供帮助,但它们需要持续训练以适应社交媒体不断变化的格局。在本研究中,我们评估了八种开源大语言模型(LLMs)在检测反犹内容方面的能力,特别利用了上下文定义作为政策指导。我们探索了多种提示技术,并设计了一种新的类似思维链(CoT)的提示方法——引导式思维链(Guided-CoT)。引导式思维链在处理上下文政策方面表现出色,提升了所有评估模型的性能,无论其解码配置、模型规模或推理能力如何。值得注意的是,Llama 3.1 70B的表现超越了经过微调的GPT-3.5。此外,我们分析了LLM的错误,并引入了量化模型生成理由中语义差异的指标,揭示了LLMs之间显著的差异和矛盾行为。我们的实验凸显了LLMs在实用性、可解释性和可靠性方面存在的差异。
English
Detecting hateful content is a challenging and important problem. Automated
tools, like machine-learning models, can help, but they require continuous
training to adapt to the ever-changing landscape of social media. In this work,
we evaluate eight open-source LLMs' capability to detect antisemitic content,
specifically leveraging in-context definition as a policy guideline. We explore
various prompting techniques and design a new CoT-like prompt, Guided-CoT.
Guided-CoT handles the in-context policy well, increasing performance across
all evaluated models, regardless of decoding configuration, model sizes, or
reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5.
Additionally, we examine LLM errors and introduce metrics to quantify semantic
divergence in model-generated rationales, revealing notable differences and
paradoxical behaviors among LLMs. Our experiments highlight the differences
observed across LLMs' utility, explainability, and reliability.