ChineseHarm-Bench: 中国語有害コンテンツ検出ベンチマーク

要旨

大規模言語モデル（LLM）は、自動化された有害コンテンツ検出タスクにますます応用されており、モデレーターがポリシー違反を特定し、コンテンツ審査の全体的な効率と精度を向上させるのに役立っています。しかし、有害コンテンツ検出のための既存のリソースは主に英語に焦点を当てており、中国語のデータセットは依然として少なく、しばしば範囲が限られています。本論文では、中国語コンテンツの有害性検出のための包括的で専門的にアノテーションされたベンチマークを提示します。このベンチマークは6つの代表的なカテゴリをカバーし、完全に実世界のデータから構築されています。また、アノテーションプロセスを通じて、中国語の有害コンテンツ検出においてLLMを支援するための明示的な専門知識を提供する知識ルールベースが得られました。さらに、人間がアノテーションした知識ルールと大規模言語モデルの暗黙的知識を統合した知識拡張ベースラインを提案し、より小さなモデルが最先端のLLMに匹敵する性能を達成できるようにしました。コードとデータはhttps://github.com/zjunlp/ChineseHarm-benchで公開されています。

English

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

ChineseHarm-Bench: 中国語有害コンテンツ検出ベンチマーク

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

要旨

Support