대형 언어 모델의 반유대주의 탐지 능력 평가

초록

증오 콘텐츠 탐지는 어렵지만 중요한 문제이다. 기계 학습 모델과 같은 자동화 도구가 도움을 줄 수 있지만, 소셜 미디어의 끊임없이 변화하는 환경에 적응하기 위해서는 지속적인 훈련이 필요하다. 본 연구에서는 여덟 가지 오픈소스 대형 언어 모델(LLM)의 반유대적 콘텐츠 탐지 능력을 평가하며, 특히 문맥 내 정의를 정책 가이드라인으로 활용한다. 다양한 프롬프트 기법을 탐구하고 새로운 CoT(Chain-of-Thought) 유사 프롬프트인 Guided-CoT를 설계한다. Guided-CoT는 문맥 내 정책을 잘 처리하며, 디코딩 구성, 모델 크기, 추론 능력과 관계없이 평가된 모든 모델의 성능을 향상시킨다. 특히, Llama 3.1 70B는 미세 조정된 GPT-3.5를 능가하는 성능을 보인다. 또한, LLM의 오류를 검토하고 모델 생성 근거의 의미적 차이를 정량화하기 위한 지표를 도입하여, LLM 간의 주목할 만한 차이와 역설적인 행동을 밝혀낸다. 본 실험은 LLM의 유용성, 설명 가능성, 신뢰성에서 관찰된 차이를 강조한다.

English

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

대형 언어 모델의 반유대주의 탐지 능력 평가

Evaluating Large Language Models for Detecting Antisemitism

초록

Support