WildGuard:針對安全風險、越獄和LLM拒絕的一站式開放式監管工具
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
June 26, 2024
作者: Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri
cs.AI
摘要
我們介紹 WildGuard — 一個針對 LLM 安全性的開放、輕量級審查工具,實現三個目標:(1) 辨識使用者提示中的惡意意圖,(2) 檢測模型回應的安全風險,以及(3) 確定模型的拒絕率。通過 WildGuard,滿足了自動安全性審查和 LLM 互動評估日益增長的需求,提供了一站式工具,具有增強的準確性,並在 13個風險類別中提供廣泛覆蓋。雖然現有的開放式審查工具如 Llama-Guard2 在分類直接模型互動方面表現良好,但在識別對抗性越獄和評估模型拒絕方面遠遠落後於提示的 GPT-4,後者是評估模型回應安全行為的關鍵指標。
為應對這些挑戰,我們構建了 WildGuardMix,一個大規模且精心平衡的多任務安全性審查數據集,包含 92K 個標記範例,涵蓋了原始(直接)提示和對抗性越獄,並配對各種拒絕和遵從回應。WildGuardMix 是 WildGuard 的訓練數據 WildGuardTrain 和高質量的人工標記審查測試集 WildGuardTest 的結合,後者包含 5K 個標記項目,涵蓋廣泛的風險情境。通過對 WildGuardTest 和十個現有公開基準的廣泛評估,我們展示了 WildGuard 在開源安全性審查中在所有三個任務上的最新表現,相較於十個強大的現有開源審查模型(例如,拒絕檢測提升高達 26.4%)。重要的是,WildGuard 與 GPT-4 的表現相匹敵,有時甚至超越(例如,提示有害性識別提升高達 3.9%)。WildGuard 在 LLM 介面中作為高效的安全性審查員,將越獄攻擊的成功率從 79.8% 降低至 2.4%。
English
We introduce WildGuard -- an open, light-weight moderation tool for LLM
safety that achieves three goals: (1) identifying malicious intent in user
prompts, (2) detecting safety risks of model responses, and (3) determining
model refusal rate. Together, WildGuard serves the increasing needs for
automatic safety moderation and evaluation of LLM interactions, providing a
one-stop tool with enhanced accuracy and broad coverage across 13 risk
categories. While existing open moderation tools such as Llama-Guard2 score
reasonably well in classifying straightforward model interactions, they lag far
behind a prompted GPT-4, especially in identifying adversarial jailbreaks and
in evaluating models' refusals, a key measure for evaluating safety behaviors
in model responses.
To address these challenges, we construct WildGuardMix, a large-scale and
carefully balanced multi-task safety moderation dataset with 92K labeled
examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired
with various refusal and compliance responses. WildGuardMix is a combination of
WildGuardTrain, the training data of WildGuard, and WildGuardTest, a
high-quality human-annotated moderation test set with 5K labeled items covering
broad risk scenarios. Through extensive evaluations on WildGuardTest and ten
existing public benchmarks, we show that WildGuard establishes state-of-the-art
performance in open-source safety moderation across all the three tasks
compared to ten strong existing open-source moderation models (e.g., up to
26.4% improvement on refusal detection). Importantly, WildGuard matches and
sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt
harmfulness identification). WildGuard serves as a highly effective safety
moderator in an LLM interface, reducing the success rate of jailbreak attacks
from 79.8% to 2.4%.