ShieldGemma:基於 Gemma 的生成式人工智慧內容審查
ShieldGemma: Generative AI Content Moderation Based on Gemma
July 31, 2024
作者: Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, Oscar Wahltinez
cs.AI
摘要
我們提出了ShieldGemma,這是一套基於Gemma2的LLM安全內容審查模型套件。這些模型能夠在使用者輸入和LLM生成的輸出中,對主要危害類型(性暴力、危險內容、騷擾、仇恨言論)的安全風險進行堅固且最先進的預測。通過在公開和內部基準上的評估,我們展示了與現有模型(如Llama Guard,在公開基準上+10.8\% AU-PRC,WildCard +4.3\%)相比的卓越性能。此外,我們提出了一個新穎的基於LLM的數據精選管道,可適應各種與安全相關的任務以及更廣泛的應用。我們展示了主要在合成數據上訓練的模型具有強大的泛化性能。通過釋出ShieldGemma,我們為研究社區提供了一個寶貴的資源,推動了LLM安全性的發展,並為開發者創造更有效的內容審查解決方案提供了可能。
English
We present ShieldGemma, a comprehensive suite of LLM-based safety content
moderation models built upon Gemma2. These models provide robust,
state-of-the-art predictions of safety risks across key harm types (sexually
explicit, dangerous content, harassment, hate speech) in both user input and
LLM-generated output. By evaluating on both public and internal benchmarks, we
demonstrate superior performance compared to existing models, such as Llama
Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%).
Additionally, we present a novel LLM-based data curation pipeline, adaptable to
a variety of safety-related tasks and beyond. We have shown strong
generalization performance for model trained mainly on synthetic data. By
releasing ShieldGemma, we provide a valuable resource to the research
community, advancing LLM safety and enabling the creation of more effective
content moderation solutions for developers.Summary
AI-Generated Summary