大规模野外协作:从野外越狱到(对抗性地)更安全的语言模型
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
June 26, 2024
作者: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
cs.AI
摘要
我們介紹 WildTeaming,一個自動的LLM安全紅隊框架,通過挖掘野外用戶-聊天機器人互動,發現5.7K個獨特的監獄越獄策略集群,然後組合多個策略以系統性地探索新型越獄方法。與以往通過招募人工工作者、基於梯度的優化或LLM的迭代修訂進行紅隊作業相比,我們的工作從未被特別指示來破壞系統的聊天機器人用戶中調查越獄。WildTeaming揭示了前沿LLM的先前未識別的漏洞,導致與最先進的越獄方法相比,敵對攻擊的多樣性和成功率提高了4.6倍。
雖然存在許多用於越獄評估的數據集,但很少有用於越獄訓練的開源數據集,因為即使模型權重是公開的,安全訓練數據也是封閉的。通過WildTeaming,我們創建了WildJailbreak,一個大規模的開源合成安全數據集,包含262K個基本(直接請求)和敵對(複雜越獄)提示-回應對。為了減輕誇張的安全行為,WildJailbreak提供兩種對比類型的查詢:1)有害查詢(基本和敵對)和2)類似有害查詢形式但不包含任何危害的良性查詢。由於WildJailbreak大幅提升了現有安全資源的質量和規模,它獨特地使我們能夠檢驗數據的規模效應以及在安全訓練期間數據屬性和模型能力之間的相互作用。通過廣泛實驗,我們確定了實現安全行為理想平衡的訓練特性:適當的保護而不過度拒絕,有效處理基本和敵對查詢,以及最小化或完全消除一般能力的降低。WildJailbreak的所有組件都有助於實現模型平衡的安全行為。
English
We introduce WildTeaming, an automatic LLM safety red-teaming framework that
mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of
novel jailbreak tactics, and then composes multiple tactics for systematic
exploration of novel jailbreaks. Compared to prior work that performed
red-teaming via recruited human workers, gradient-based optimization, or
iterative revision with LLMs, our work investigates jailbreaks from chatbot
users who were not specifically instructed to break the system. WildTeaming
reveals previously unidentified vulnerabilities of frontier LLMs, resulting in
up to 4.6x more diverse and successful adversarial attacks compared to
state-of-the-art jailbreak methods.
While many datasets exist for jailbreak evaluation, very few open-source
datasets exist for jailbreak training, as safety training data has been closed
even when model weights are open. With WildTeaming we create WildJailbreak, a
large-scale open-source synthetic safety dataset with 262K vanilla (direct
request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate
exaggerated safety behaviors, WildJailbreak provides two contrastive types of
queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that
resemble harmful queries in form but contain no harm. As WildJailbreak
considerably upgrades the quality and scale of existing safety resources, it
uniquely enables us to examine the scaling effects of data and the interplay of
data properties and model capabilities during safety training. Through
extensive experiments, we identify the training properties that enable an ideal
balance of safety behaviors: appropriate safeguarding without over-refusal,
effective handling of vanilla and adversarial queries, and minimal, if any,
decrease in general capabilities. All components of WildJailbeak contribute to
achieving balanced safety behaviors of models.Summary
AI-Generated Summary