언어 모델을 위한 평생 안전 정렬

초록

LLM(대형 언어 모델)은 인상적인 발전을 이루었지만, 그 능력이 커짐에 따라 안전 정렬을 우회하도록 설계된 매우 유연한 탈옥(jailbreaking) 공격에 노출되기도 합니다. 기존의 많은 방어 기법들은 알려진 공격 유형에 초점을 맞추고 있지만, 실제 배포 과정에서 발생할 수 있는 미지의 공격에 대비하는 것이 더욱 중요합니다. 이를 해결하기 위해, 우리는 LLM이 새로운 탈옥 전략에 지속적으로 적응할 수 있도록 하는 평생 안전 정렬(lifelong safety alignment) 프레임워크를 제안합니다. 이 프레임워크는 두 가지 구성 요소 간의 경쟁 구조를 도입합니다: 새로운 탈옥 전략을 적극적으로 발견하도록 훈련된 메타 공격자(Meta-Attacker)와 이를 방어하도록 훈련된 방어자(Defender)입니다. 메타 공격자를 효과적으로 준비시키기 위해, 우리는 먼저 GPT-4o API를 활용하여 탈옥 관련 연구 논문 대량에서 핵심 통찰을 추출합니다. 반복적인 훈련을 통해, 첫 번째 반복에서 메타 공격자는 단일 턴 공격만으로 RR에서 73%의 공격 성공률(ASR)을, LAT에서 57%의 전이 공격 성공률을 달성했습니다. 한편, 방어자는 점점 더 견고해져 결국 메타 공격자의 성공률을 단 7%로 낮추어, 개방형 환경에서 LLM을 더 안전하고 신뢰할 수 있게 배포할 수 있게 합니다. 코드는 https://github.com/sail-sg/LifelongSafetyAlignment에서 확인할 수 있습니다.

English

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

언어 모델을 위한 평생 안전 정렬

Lifelong Safety Alignment for Language Models

초록

Support