WildGuard: LLMの安全性リスク、ジェイルブレイク、拒否応答に対するオープンなワンストップモデレーションツール

要旨

私たちはWildGuardを紹介します。これは、LLMの安全性のためのオープンで軽量なモデレーションツールであり、以下の3つの目標を達成します。(1) ユーザープロンプト内の悪意ある意図の識別、(2) モデル応答の安全性リスクの検出、(3) モデルの拒否率の判定。WildGuardは、LLMインタラクションの自動安全モデレーションと評価の増大するニーズに応え、13のリスクカテゴリーにわたる高い精度と広範なカバレッジを提供するワンストップツールとして機能します。既存のオープンモデレーションツール（例：Llama-Guard2）は、単純なモデルインタラクションの分類においては一定のスコアを達成していますが、特に敵対的なジャイルブレイクの識別や、モデル応答の安全性評価における重要な指標である拒否率の評価においては、プロンプトされたGPT-4に大きく遅れを取っています。これらの課題に対処するため、私たちはWildGuardMixを構築しました。これは、92Kのラベル付き例を含む大規模で注意深くバランスの取れたマルチタスク安全モデレーションデータセットであり、バニラ（直接的な）プロンプトと敵対的なジャイルブレイクをカバーし、さまざまな拒否および準拠応答とペアになっています。WildGuardMixは、WildGuardのトレーニングデータであるWildGuardTrainと、5Kのラベル付き項目を含む高品質な人間によるアノテーションが施されたモデレーションテストセットであるWildGuardTestの組み合わせです。WildGuardTestと10の既存の公開ベンチマークでの広範な評価を通じて、WildGuardが10の強力な既存のオープンソースモデレーションモデルと比較して、3つのタスクすべてにおいて最先端のパフォーマンスを確立していることを示しています（例：拒否検出において最大26.4%の改善）。重要なことに、WildGuardはGPT-4のパフォーマンスに匹敵し、時にはそれを上回ります（例：プロンプトの有害性識別において最大3.9%の改善）。WildGuardは、LLMインターフェースにおいて非常に効果的な安全モデレーターとして機能し、ジャイルブレイク攻撃の成功率を79.8%から2.4%に削減します。

English

We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

WildGuard: LLMの安全性リスク、ジェイルブレイク、拒否応答に対するオープンなワンストップモデレーションツール

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

要旨

Support