대규모 WildTeaming: 실제 환경의 Jailbreak에서 (적대적으로) 더 안전한 언어 모델로

초록

우리는 WildTeaming을 소개합니다. 이는 실제 사용자와 챗봇 간의 상호작용을 분석하여 5,700개의 독창적인 jailbreak 전술 클러스터를 발견하고, 이러한 전술들을 조합하여 새로운 jailbreak를 체계적으로 탐색하는 자동화된 LLM 안전성 레드팀 프레임워크입니다. 기존 연구들이 모집된 인간 작업자, 그래디언트 기반 최적화, 또는 LLM을 통한 반복적 수정을 통해 레드팀을 수행한 것과 달리, 본 연구는 시스템을 의도적으로 파괴하도록 지시받지 않은 챗봇 사용자들의 jailbreak 사례를 조사합니다. WildTeaming은 최신 LLM의 이전에 알려지지 않은 취약점을 밝혀내며, 최첨단 jailbreak 방법 대비 최대 4.6배 더 다양하고 성공적인 적대적 공격을 가능하게 합니다. jailbreak 평가를 위한 많은 데이터셋이 존재하지만, jailbreak 훈련을 위한 오픈소스 데이터셋은 매우 드뭅니다. 특히 모델 가중치가 공개된 경우에도 안전성 훈련 데이터는 폐쇄적으로 유지되어 왔습니다. WildTeaming을 통해 우리는 262,000개의 일반(직접 요청) 및 적대적(복잡한 jailbreak) 프롬프트-응답 쌍으로 구성된 대규모 오픈소스 합성 안전성 데이터셋인 WildJailbreak를 생성했습니다. 과도한 안전성 행동을 완화하기 위해 WildJailbreak는 두 가지 대조적인 유형의 쿼리를 제공합니다: 1) 유해한 쿼리(일반 및 적대적)와 2) 형태상 유해한 쿼리와 유사하지만 실제로는 해가 없는 무해한 쿼리입니다. WildJailbreak는 기존 안전성 리소스의 품질과 규모를 크게 업그레이드함으로써, 데이터의 스케일링 효과와 안전성 훈련 중 데이터 속성과 모델 능력 간의 상호작용을 검토할 수 있는 독보적인 기회를 제공합니다. 광범위한 실험을 통해 우리는 이상적인 안전성 행동의 균형을 가능하게 하는 훈련 속성을 확인했습니다: 과도한 거부 없이 적절한 보호, 일반 및 적대적 쿼리의 효과적 처리, 그리고 일반 능력의 최소한의 감소(있는 경우). WildJailbreak의 모든 구성 요소는 모델의 균형 잡힌 안전성 행동 달성에 기여합니다.

English

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

대규모 WildTeaming: 실제 환경의 Jailbreak에서 (적대적으로) 더 안전한 언어 모델로

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

초록

Support