WildGuard: 대형 언어 모델의 안전 위험, 탈옥 시도 및 거부 행위를 위한 오픈 원스톱 조정 도구

초록

우리는 WildGuard를 소개합니다. 이는 LLM 안전을 위한 개방형 경량 조정 도구로, 세 가지 목표를 달성합니다: (1) 사용자 프롬프트에서 악의적 의도를 식별, (2) 모델 응답의 안전 위험 감지, (3) 모델 거부율 결정. WildGuard는 LLM 상호작용에 대한 자동 안전 조정 및 평가의 증가하는 요구를 충족시키며, 13개 위험 범주에 걸쳐 향상된 정확성과 광범위한 커버리지를 제공하는 원스톱 도구입니다. Llama-Guard2와 같은 기존 개방형 조정 도구들은 직관적인 모델 상호작용 분류에서 꽤 잘 수행되지만, 특히 적대적 jailbreak 식별과 모델 응답의 안전 행동 평가를 위한 핵심 지표인 모델 거부 평가에서 프롬프트된 GPT-4에 비해 크게 뒤떨어집니다. 이러한 문제를 해결하기 위해, 우리는 92K개의 라벨링된 예시로 구성된 대규모의 신중하게 균형 잡힌 다중 작업 안전 조정 데이터셋인 WildGuardMix를 구축했습니다. 이 데이터셋은 일반(직접) 프롬프트와 적대적 jailbreak를 포함하며, 다양한 거부 및 준수 응답과 짝을 이룹니다. WildGuardMix는 WildGuard의 학습 데이터인 WildGuardTrain과 5K개의 라벨링된 항목으로 구성된 고품질 인간 주석 조정 테스트 세트인 WildGuardTest의 조합입니다. WildGuardTest와 기존 10개 공개 벤치마크에 대한 광범위한 평가를 통해, WildGuard는 10개의 강력한 기존 오픈소스 조정 모델과 비교하여 세 가지 작업 모두에서 최첨단 성능을 달성함을 보여줍니다(예: 거부 감지에서 최대 26.4% 향상). 특히, WildGuard는 GPT-4 성능과 일치하거나 때로는 이를 초과합니다(예: 프롬프트 유해성 식별에서 최대 3.9% 향상). WildGuard는 LLM 인터페이스에서 매우 효과적인 안전 조정자 역할을 하며, jailbreak 공격의 성공률을 79.8%에서 2.4%로 줄입니다.

English

We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

WildGuard: 대형 언어 모델의 안전 위험, 탈옥 시도 및 거부 행위를 위한 오픈 원스톱 조정 도구

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

초록

Support