ChineseHarm-Bench: 중국어 유해 콘텐츠 탐지 벤치마크

초록

대규모 언어 모델(LLMs)은 자동화된 유해 콘텐츠 탐지 작업에 점점 더 많이 적용되며, 관리자들이 정책 위반 사항을 식별하고 콘텐츠 검토의 전반적인 효율성과 정확성을 개선하는 데 도움을 주고 있습니다. 그러나 현재 유해 콘텐츠 탐지를 위한 리소스는 주로 영어에 초점이 맞춰져 있으며, 중국어 데이터셋은 여전히 부족하고 종종 범위가 제한적입니다. 본 연구에서는 중국어 콘텐츠 유해성 탐지를 위한 포괄적이고 전문적으로 주석이 달린 벤치마크를 제시합니다. 이 벤치마크는 6개의 대표적인 범주를 다루며, 전적으로 실세계 데이터로 구성되었습니다. 우리의 주석 프로세스는 또한 중국어 유해 콘텐츠 탐지를 위해 LLMs에 명시적인 전문가 지식을 제공하는 지식 규칙 기반을 산출합니다. 추가적으로, 우리는 인간이 주석을 단 지식 규칙과 대규모 언어 모델의 암묵적 지식을 통합한 지식 증강 기반선을 제안합니다. 이를 통해 더 작은 모델들이 최첨단 LLMs에 필적하는 성능을 달성할 수 있게 합니다. 코드와 데이터는 https://github.com/zjunlp/ChineseHarm-bench에서 확인할 수 있습니다.

English

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

ChineseHarm-Bench: 중국어 유해 콘텐츠 탐지 벤치마크

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

초록

Support