Jailbreaking per Jailbreak

Abstract

L'addestramento al rifiuto nei Large Language Models (LLM) previene output dannosi, tuttavia questa difesa rimane vulnerabile sia a jailbreak automatizzati che creati da esseri umani. Presentiamo un nuovo approccio LLM-come-red-teamer in cui un essere umano esegue un jailbreak su un LLM addestrato al rifiuto per renderlo disposto a eseguire jailbreak su se stesso o su altri LLM. Definiamo gli LLM sottoposti a jailbreak come attaccanti J_2, che possono valutare sistematicamente i modelli target utilizzando varie strategie di red teaming e migliorare le proprie prestazioni attraverso l'apprendimento in-context dai fallimenti precedenti. I nostri esperimenti dimostrano che Sonnet 3.5 e Gemini 1.5 pro superano altri LLM come J_2, raggiungendo rispettivamente tassi di successo dell'attacco (ASR) del 93,0% e del 91,0% contro GPT-4o (e risultati simili su altri LLM capaci) su Harmbench. Il nostro lavoro non solo introduce un approccio scalabile al red teaming strategico, traendo ispirazione dai red teamer umani, ma evidenzia anche il jailbreaking-to-jailbreak come una modalità di fallimento trascurata della salvaguardia. Nello specifico, un LLM può bypassare le proprie salvaguardie impiegando una versione jailbroken di se stesso che è disposta ad assistere in ulteriori jailbreak. Per prevenire qualsiasi uso improprio diretto di J_2, pur avanzando la ricerca nella sicurezza dell'IA, condividiamo pubblicamente la nostra metodologia mantenendo privati i dettagli specifici del prompting.

English

Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J_2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet 3.5 and Gemini 1.5 pro outperform other LLMs as J_2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming, drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J_2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.

Jailbreaking per Jailbreak

Jailbreaking to Jailbreak

Abstract

Support