ChineseHarm-Bench：中文有害内容检测基准

摘要

大型语言模型（LLMs）在自动化有害内容检测任务中的应用日益广泛，协助内容审核员识别违规行为，并提升内容审查的整体效率与准确性。然而，现有有害内容检测资源主要集中于英语领域，中文数据集仍显稀缺且往往范围有限。本研究提出了一套全面、专业标注的中文内容危害检测基准，涵盖六大代表性类别，并完全基于真实世界数据构建。我们的标注过程进一步生成了一个知识规则库，为LLMs在中文有害内容检测中提供明确的专家知识支持。此外，我们提出了一种知识增强的基线方法，该方法整合了人工标注的知识规则与大型语言模型中的隐含知识，使得较小模型也能达到与最先进LLMs相媲美的性能。代码与数据可通过https://github.com/zjunlp/ChineseHarm-bench获取。

English

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

ChineseHarm-Bench：中文有害内容检测基准

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

摘要

Support