評估大型語言模型在檢測反猶太主義中的效能

摘要

偵測仇恨內容是一項既具挑戰性又至關重要的課題。自動化工具，如機器學習模型，雖能提供協助，但需持續訓練以適應社交媒體不斷變遷的環境。本研究評估了八種開源大型語言模型（LLMs）在檢測反猶太內容方面的能力，特別是利用上下文定義作為政策指導方針。我們探討了多種提示技術，並設計了一種新的類比鏈式思維（CoT）提示方法——引導式CoT（Guided-CoT）。引導式CoT能有效處理上下文政策，無論解碼配置、模型規模或推理能力如何，均提升了所有評估模型的表現。值得注意的是，Llama 3.1 70B的表現超越了經過微調的GPT-3.5。此外，我們檢視了LLM的錯誤，並引入量化模型生成理由中語義分歧的指標，揭示了LLM之間顯著的差異及矛盾行為。我們的實驗凸顯了LLM在實用性、可解釋性及可靠性方面所觀察到的差異。

English

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

評估大型語言模型在檢測反猶太主義中的效能

Evaluating Large Language Models for Detecting Antisemitism

摘要

Support