語言模型的終身安全對齊

摘要

大型語言模型（LLMs）已取得顯著進展，但其日益增強的效能也使其面臨高度靈活的越獄攻擊，這些攻擊旨在繞過安全對齊機制。儘管現有許多防禦措施專注於已知攻擊類型，但更關鍵的是為LLMs做好準備，以應對部署期間可能出現的未知攻擊。為此，我們提出了一種終身安全對齊框架，使LLMs能夠持續適應新興且不斷演變的越獄策略。我們的框架引入了兩個組件之間的競爭設置：一個是元攻擊者（Meta-Attacker），其訓練目標是主動發現新型越獄策略；另一個是防禦者（Defender），其訓練目標是抵抗這些策略。為了有效預熱元攻擊者，我們首先利用GPT-4o API從大量與越獄相關的研究論文中提取關鍵見解。通過迭代訓練，第一輪元攻擊者在僅使用單輪攻擊的情況下，於RR上達到了73%的攻擊成功率（ASR），在LAT上達到了57%的轉移攻擊成功率。與此同時，防禦者逐步提升其魯棒性，最終將元攻擊者的成功率降低至僅7%，從而實現了LLMs在開放環境中更安全、更可靠的部署。相關代碼已公開於https://github.com/sail-sg/LifelongSafetyAlignment。

English

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

語言模型的終身安全對齊

Lifelong Safety Alignment for Language Models

摘要

Support