评估大型语言模型在检测反犹太主义中的应用

摘要

检测仇恨内容是一项既具挑战性又至关重要的任务。自动化工具，如机器学习模型，能够提供帮助，但它们需要持续训练以适应社交媒体不断变化的格局。在本研究中，我们评估了八种开源大语言模型（LLMs）在检测反犹内容方面的能力，特别利用了上下文定义作为政策指导。我们探索了多种提示技术，并设计了一种新的类似思维链（CoT）的提示方法——引导式思维链（Guided-CoT）。引导式思维链在处理上下文政策方面表现出色，提升了所有评估模型的性能，无论其解码配置、模型规模或推理能力如何。值得注意的是，Llama 3.1 70B的表现超越了经过微调的GPT-3.5。此外，我们分析了LLM的错误，并引入了量化模型生成理由中语义差异的指标，揭示了LLMs之间显著的差异和矛盾行为。我们的实验凸显了LLMs在实用性、可解释性和可靠性方面存在的差异。

English

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

评估大型语言模型在检测反犹太主义中的应用

Evaluating Large Language Models for Detecting Antisemitism

摘要

Support