语言模型的终身安全对齐
Lifelong Safety Alignment for Language Models
May 26, 2025
作者: Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
cs.AI
摘要
大型语言模型(LLMs)已取得显著进展,但其日益增强的能力也使其面临旨在绕过安全对齐的高度灵活的越狱攻击。尽管现有许多防御措施专注于已知攻击类型,但更为关键的是为LLMs应对部署期间可能出现的未知攻击做好准备。为此,我们提出了一种终身安全对齐框架,使LLMs能够持续适应新出现的和不断演变的越狱策略。该框架引入了一种竞争机制,包含两个组件:一个元攻击者(Meta-Attacker),其训练目标是主动发现新颖的越狱策略;以及一个防御者(Defender),其训练目标是抵御这些攻击。为了有效预热元攻击者,我们首先利用GPT-4o API从大量越狱相关研究论文中提取关键见解。通过迭代训练,第一轮元攻击者在仅使用单轮攻击的情况下,对RR实现了73%的攻击成功率(ASR),对LAT实现了57%的转移ASR。与此同时,防御者逐步提升其鲁棒性,最终将元攻击者的成功率降至仅7%,从而在开放环境中实现更安全、更可靠的LLMs部署。代码可在https://github.com/sail-sg/LifelongSafetyAlignment获取。
English
LLMs have made impressive progress, but their growing capabilities also
expose them to highly flexible jailbreaking attacks designed to bypass safety
alignment. While many existing defenses focus on known types of attacks, it is
more critical to prepare LLMs for unseen attacks that may arise during
deployment. To address this, we propose a lifelong safety alignment framework
that enables LLMs to continuously adapt to new and evolving jailbreaking
strategies. Our framework introduces a competitive setup between two
components: a Meta-Attacker, trained to actively discover novel jailbreaking
strategies, and a Defender, trained to resist them. To effectively warm up the
Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a
large collection of jailbreak-related research papers. Through iterative
training, the first iteration Meta-Attacker achieves a 73% attack success rate
(ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks.
Meanwhile, the Defender progressively improves its robustness and ultimately
reduces the Meta-Attacker's success rate to just 7%, enabling safer and more
reliable deployment of LLMs in open-ended environments. The code is available
at https://github.com/sail-sg/LifelongSafetyAlignment.Summary
AI-Generated Summary