WildGuard：針對安全風險、越獄和LLM拒絕的一站式開放式監管工具

摘要

我們介紹 WildGuard — 一個針對 LLM 安全性的開放、輕量級審查工具，實現三個目標：(1) 辨識使用者提示中的惡意意圖，(2) 檢測模型回應的安全風險，以及(3) 確定模型的拒絕率。通過 WildGuard，滿足了自動安全性審查和 LLM 互動評估日益增長的需求，提供了一站式工具，具有增強的準確性，並在 13個風險類別中提供廣泛覆蓋。雖然現有的開放式審查工具如 Llama-Guard2 在分類直接模型互動方面表現良好，但在識別對抗性越獄和評估模型拒絕方面遠遠落後於提示的 GPT-4，後者是評估模型回應安全行為的關鍵指標。為應對這些挑戰，我們構建了 WildGuardMix，一個大規模且精心平衡的多任務安全性審查數據集，包含 92K 個標記範例，涵蓋了原始（直接）提示和對抗性越獄，並配對各種拒絕和遵從回應。WildGuardMix 是 WildGuard 的訓練數據 WildGuardTrain 和高質量的人工標記審查測試集 WildGuardTest 的結合，後者包含 5K 個標記項目，涵蓋廣泛的風險情境。通過對 WildGuardTest 和十個現有公開基準的廣泛評估，我們展示了 WildGuard 在開源安全性審查中在所有三個任務上的最新表現，相較於十個強大的現有開源審查模型（例如，拒絕檢測提升高達 26.4%）。重要的是，WildGuard 與 GPT-4 的表現相匹敵，有時甚至超越（例如，提示有害性識別提升高達 3.9%）。WildGuard 在 LLM 介面中作為高效的安全性審查員，將越獄攻擊的成功率從 79.8% 降低至 2.4%。

English

We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

WildGuard：針對安全風險、越獄和LLM拒絕的一站式開放式監管工具

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

摘要

Support