语言模型的终身安全对齐

摘要

大型语言模型（LLMs）已取得显著进展，但其日益增强的能力也使其面临旨在绕过安全对齐的高度灵活的越狱攻击。尽管现有许多防御措施专注于已知攻击类型，但更为关键的是为LLMs应对部署期间可能出现的未知攻击做好准备。为此，我们提出了一种终身安全对齐框架，使LLMs能够持续适应新出现的和不断演变的越狱策略。该框架引入了一种竞争机制，包含两个组件：一个元攻击者（Meta-Attacker），其训练目标是主动发现新颖的越狱策略；以及一个防御者（Defender），其训练目标是抵御这些攻击。为了有效预热元攻击者，我们首先利用GPT-4o API从大量越狱相关研究论文中提取关键见解。通过迭代训练，第一轮元攻击者在仅使用单轮攻击的情况下，对RR实现了73%的攻击成功率（ASR），对LAT实现了57%的转移ASR。与此同时，防御者逐步提升其鲁棒性，最终将元攻击者的成功率降至仅7%，从而在开放环境中实现更安全、更可靠的LLMs部署。代码可在https://github.com/sail-sg/LifelongSafetyAlignment获取。

English

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

语言模型的终身安全对齐

Lifelong Safety Alignment for Language Models

摘要

Support