Allineamento Sicuro Permanente per Modelli Linguistici

Abstract

I LLM hanno compiuto progressi impressionanti, ma le loro crescenti capacità li espongono anche ad attacchi di jailbreaking altamente flessibili progettati per bypassare l'allineamento alla sicurezza. Mentre molte difese esistenti si concentrano su tipologie di attacchi note, è più cruciale preparare i LLM ad attacchi non visti che potrebbero emergere durante il dispiegamento. Per affrontare questo problema, proponiamo un framework di allineamento alla sicurezza lifelong che consente ai LLM di adattarsi continuamente a nuove e in evoluzione strategie di jailbreaking. Il nostro framework introduce una configurazione competitiva tra due componenti: un Meta-Attaccante, addestrato a scoprire attivamente nuove strategie di jailbreaking, e un Difensore, addestrato a resistervi. Per riscaldare efficacemente il Meta-Attaccante, sfruttiamo prima l'API di GPT-4 per estrarre intuizioni chiave da una vasta raccolta di articoli di ricerca relativi al jailbreaking. Attraverso un addestramento iterativo, il Meta-Attaccante della prima iterazione raggiunge un tasso di successo degli attacchi (ASR) del 73% su RR e un ASR di trasferimento del 57% su LAT utilizzando solo attacchi a turno singolo. Nel frattempo, il Difensore migliora progressivamente la sua robustezza e alla fine riduce il tasso di successo del Meta-Attaccante a solo il 7%, consentendo un dispiegamento più sicuro e affidabile dei LLM in ambienti aperti. Il codice è disponibile all'indirizzo https://github.com/sail-sg/LifelongSafetyAlignment.

English

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

Allineamento Sicuro Permanente per Modelli Linguistici

Lifelong Safety Alignment for Language Models

Abstract

Support