言語モデルの生涯にわたる安全性アラインメント

要旨

LLM（大規模言語モデル）は目覚ましい進歩を遂げてきましたが、その能力の向上に伴い、安全性のアライメントを回避するための高度に柔軟なジャイルブレイク攻撃にさらされるリスクも増大しています。既存の多くの防御策は既知の攻撃タイプに焦点を当てていますが、実際の運用中に発生する可能性のある未知の攻撃に対してLLMを準備することがより重要です。この問題に対処するため、私たちはLLMが新たに進化するジャイルブレイク戦略に継続的に適応できる「生涯安全アライメント」フレームワークを提案します。このフレームワークでは、2つのコンポーネント間の競争的な仕組みを導入しています。1つは、新たなジャイルブレイク戦略を積極的に発見するように訓練された「メタアタッカー」、もう1つはそれらに抵抗するように訓練された「ディフェンダー」です。メタアタッカーを効果的にウォームアップするため、まずGPT-4o APIを活用して、ジャイルブレイク関連の研究論文の大規模なコレクションから重要な洞察を抽出します。反復的なトレーニングを通じて、最初のイテレーションのメタアタッカーは、単一ターンの攻撃のみでRRに対して73%の攻撃成功率（ASR）、LATに対して57%の転移ASRを達成しました。一方、ディフェンダーはその堅牢性を徐々に向上させ、最終的にメタアタッカーの成功率をわずか7%にまで低下させ、オープンエンド環境でのLLMのより安全で信頼性の高い運用を可能にします。コードはhttps://github.com/sail-sg/LifelongSafetyAlignmentで公開されています。

English

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

言語モデルの生涯にわたる安全性アラインメント

Lifelong Safety Alignment for Language Models

要旨

Support