大规模野外团队合作:从野外越狱到(对抗性地)更安全的语言模型
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
June 26, 2024
作者: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
cs.AI
摘要
我们介绍了WildTeaming,这是一个自动的LLM安全红队框架,它通过挖掘野外用户-聊天机器人交互来发现5.7K个独特的监狱越狱策略集群,然后组合多种策略以系统地探索新型越狱方法。与之前通过招募人工工作者、基于梯度的优化或LLMs的迭代修订执行红队行动的工作相比,我们的工作从未被明确指示要破解系统的聊天机器人用户中调查越狱。WildTeaming揭示了前沿LLMs的以前未知的漏洞,导致对抗性攻击的多样性和成功性比最先进的越狱方法高出多达4.6倍。
虽然存在许多用于越狱评估的数据集,但很少有用于越狱训练的开源数据集,因为即使模型权重是公开的,安全训练数据也是封闭的。通过WildTeaming,我们创建了WildJailbreak,这是一个大规模的开源合成安全数据集,包含262K个普通(直接请求)和对抗性(复杂越狱)提示-响应对。为了减轻夸大的安全行为,WildJailbreak提供了两种对比类型的查询:1)有害查询(普通和对抗性)和2)类似于有害查询形式但不包含危害的良性查询。由于WildJailbreak显著提升了现有安全资源的质量和规模,它独特地使我们能够检验数据的扩展效应以及数据属性和模型能力在安全训练期间的相互作用。通过广泛的实验,我们确定了使安全行为达到理想平衡的训练属性:适当的保护而不过度拒绝,有效处理普通和对抗性查询,并且在一般能力方面减少最小化,如果有的话。WildJailbreak的所有组件都有助于实现模型的平衡安全行为。
English
We introduce WildTeaming, an automatic LLM safety red-teaming framework that
mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of
novel jailbreak tactics, and then composes multiple tactics for systematic
exploration of novel jailbreaks. Compared to prior work that performed
red-teaming via recruited human workers, gradient-based optimization, or
iterative revision with LLMs, our work investigates jailbreaks from chatbot
users who were not specifically instructed to break the system. WildTeaming
reveals previously unidentified vulnerabilities of frontier LLMs, resulting in
up to 4.6x more diverse and successful adversarial attacks compared to
state-of-the-art jailbreak methods.
While many datasets exist for jailbreak evaluation, very few open-source
datasets exist for jailbreak training, as safety training data has been closed
even when model weights are open. With WildTeaming we create WildJailbreak, a
large-scale open-source synthetic safety dataset with 262K vanilla (direct
request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate
exaggerated safety behaviors, WildJailbreak provides two contrastive types of
queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that
resemble harmful queries in form but contain no harm. As WildJailbreak
considerably upgrades the quality and scale of existing safety resources, it
uniquely enables us to examine the scaling effects of data and the interplay of
data properties and model capabilities during safety training. Through
extensive experiments, we identify the training properties that enable an ideal
balance of safety behaviors: appropriate safeguarding without over-refusal,
effective handling of vanilla and adversarial queries, and minimal, if any,
decrease in general capabilities. All components of WildJailbeak contribute to
achieving balanced safety behaviors of models.Summary
AI-Generated Summary