ChatPaper.aiChatPaper

ShieldGemma:基于Gemma的生成式人工智能内容审核

ShieldGemma: Generative AI Content Moderation Based on Gemma

July 31, 2024
作者: Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, Oscar Wahltinez
cs.AI

摘要

我们提出了ShieldGemma,这是基于Gemma2构建的基于LLM的全面安全内容调节模型套件。这些模型可在用户输入和LLM生成的输出中,对关键危害类型(性暴露、危险内容、骚扰、仇恨言论)提供强大的、最先进的安全风险预测。通过在公共和内部基准上进行评估,我们展示了与现有模型(如Llama Guard,在公共基准上+10.8\% AU-PRC,WildCard上+4.3\%)相比的卓越性能。此外,我们提出了一种新颖的基于LLM的数据筛选管道,可适用于各种安全相关任务及其他领域。我们展示了主要基于合成数据训练的模型具有强大的泛化性能。通过发布ShieldGemma,我们为研究社区提供了宝贵资源,推动了LLM安全领域的发展,为开发人员创造更有效的内容调节解决方案提供了可能。
English
We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.

Summary

AI-Generated Summary

PDF143November 28, 2024