ChatPaper.aiChatPaper

WildGuard:针对安全风险、越狱和LLM拒绝的一站式开放式调节工具

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

June 26, 2024
作者: Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri
cs.AI

摘要

我们介绍WildGuard——一款面向LLM安全的开放式、轻量级的审查工具,实现了三个目标:(1)识别用户提示中的恶意意图,(2)检测模型响应的安全风险,以及(3)确定模型拒绝率。WildGuard满足了对自动安全审查和LLM交互评估日益增长的需求,提供了一个一站式工具,具有增强的准确性,并在13个风险类别中提供广泛覆盖。虽然现有的开放式审查工具(如Llama-Guard2)在对直接的模型交互进行分类方面表现相当不错,但在识别对抗性越狱和评估模型拒绝等方面远远落后于GPT-4的提示,这是评估模型响应安全行为的关键指标。 为了解决这些挑战,我们构建了WildGuardMix,这是一个大规模且精心平衡的多任务安全审查数据集,包含92K个标记示例,涵盖了直接提示和对抗性越狱,并配有各种拒绝和遵从响应。WildGuardMix是WildGuard的训练数据WildGuardTrain和一个高质量的人工注释审查测试集WildGuardTest的组合,后者包含5K个标记项目,涵盖广泛的风险场景。通过对WildGuardTest和十个现有公共基准测试的广泛评估,我们展示了WildGuard在开源安全审查中在所有三个任务上与十个强大的现有开源审查模型相比取得了最先进的性能(例如,在拒绝检测上提高了高达26.4%)。重要的是,WildGuard与GPT-4的性能相匹配,有时甚至超过(例如,在提示有害性识别上提高了高达3.9%)。WildGuard在LLM界面中作为高效的安全审查员,将越狱攻击的成功率从79.8%降低到2.4%。
English
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

Summary

AI-Generated Summary

PDF131November 29, 2024