ChatPaper.aiChatPaper

評估大型語言模型在檢測反猶太主義中的效能

Evaluating Large Language Models for Detecting Antisemitism

September 22, 2025
作者: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
cs.AI

摘要

偵測仇恨內容是一項既具挑戰性又至關重要的課題。自動化工具,如機器學習模型,雖能提供協助,但需持續訓練以適應社交媒體不斷變遷的環境。本研究評估了八種開源大型語言模型(LLMs)在檢測反猶太內容方面的能力,特別是利用上下文定義作為政策指導方針。我們探討了多種提示技術,並設計了一種新的類比鏈式思維(CoT)提示方法——引導式CoT(Guided-CoT)。引導式CoT能有效處理上下文政策,無論解碼配置、模型規模或推理能力如何,均提升了所有評估模型的表現。值得注意的是,Llama 3.1 70B的表現超越了經過微調的GPT-3.5。此外,我們檢視了LLM的錯誤,並引入量化模型生成理由中語義分歧的指標,揭示了LLM之間顯著的差異及矛盾行為。我們的實驗凸顯了LLM在實用性、可解釋性及可靠性方面所觀察到的差異。
English
Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.
PDF12September 26, 2025