ShieldGemma: Gemmaベースの生成AIコンテンツモデレーション

要旨

私たちは、Gemma2を基盤とした包括的なLLMベースの安全なコンテンツモデレーションモデル群であるShieldGemmaを紹介します。これらのモデルは、ユーザー入力とLLM生成出力の両方において、主要な有害カテゴリ（性的表現、危険なコンテンツ、ハラスメント、ヘイトスピーチ）にわたる堅牢で最先端の安全リスク予測を提供します。公開ベンチマークと内部ベンチマークの両方で評価を行い、Llama Guard（公開ベンチマークで+10.8% AU-PRC）やWildCard（+4.3%）などの既存モデルと比較して優れた性能を示しています。さらに、安全性に関連する多様なタスクやそれ以外にも適応可能な、新しいLLMベースのデータキュレーションパイプラインを提示します。主に合成データでトレーニングされたモデルにおいて、強力な汎化性能を示しました。ShieldGemmaを公開することで、研究コミュニティに貴重なリソースを提供し、LLMの安全性を向上させ、開発者向けにより効果的なコンテンツモデレーションソリューションの創出を可能にします。

English

We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.

ShieldGemma: Gemmaベースの生成AIコンテンツモデレーション

ShieldGemma: Generative AI Content Moderation Based on Gemma

要旨

Support