越獄至越獄
Jailbreaking to Jailbreak
February 9, 2025
作者: Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang
cs.AI
摘要
拒絕訓練大型語言模型(LLMs)可防止有害輸出,但這種防禦仍然容易受到自動化和人工製作的越獄攻擊。我們提出了一種新穎的LLM作為紅隊成員的方法,其中一名人類越獄一個經過拒絕訓練的LLM,使其願意越獄自己或其他LLMs。我們將越獄後的LLMs 稱為 J_2 攻擊者,它們可以通過各種紅隊策略系統地評估目標模型,並通過從先前失敗中進行上下文學習來提高性能。我們的實驗表明,Sonnet 3.5 和 Gemini 1.5 在作為 J_2 時優於其他LLMs,分別在Harmbench上對GPT-4o(以及其他能力強大的LLMs)實現了93.0%和91.0%的攻擊成功率(ASRs)。我們的工作不僅引入了一種可擴展的戰略紅隊方法,從人類紅隊成員中汲取靈感,還突顯了越獄以進行越獄是一種被忽視的失敗模式。具體來說,一個LLM可以通過使用一個願意協助進一步越獄的越獄版本來繞過自己的保護措施。為了防止J_2的直接濫用,同時推進AI安全研究,我們公開分享我們的方法論,同時保留特定提示細節。
English
Refusal training on Large Language Models (LLMs) prevents harmful outputs,
yet this defense remains vulnerable to both automated and human-crafted
jailbreaks. We present a novel LLM-as-red-teamer approach in which a human
jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or
other LLMs. We refer to the jailbroken LLMs as J_2 attackers, which can
systematically evaluate target models using various red teaming strategies and
improve its performance via in-context learning from the previous failures. Our
experiments demonstrate that Sonnet 3.5 and Gemini 1.5 pro outperform other
LLMs as J_2, achieving 93.0% and 91.0% attack success rates (ASRs)
respectively against GPT-4o (and similar results across other capable LLMs) on
Harmbench. Our work not only introduces a scalable approach to strategic red
teaming, drawing inspiration from human red teamers, but also highlights
jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard.
Specifically, an LLM can bypass its own safeguards by employing a jailbroken
version of itself that is willing to assist in further jailbreaking. To prevent
any direct misuse with J_2, while advancing research in AI safety, we
publicly share our methodology while keeping specific prompting details
private.Summary
AI-Generated Summary