안전한 언러닝: Jailbreak 공격 방어를 위한 놀라울 정도로 효과적이고 일반화 가능한 솔루션

초록

LLM(Large Language Model)은 안전성 정렬(safety alignment) 이후에도 여전히 탈옥 공격(jailbreak attack)에 취약한 것으로 알려져 있습니다. 중요한 관찰은 다양한 유형의 탈옥 공격이 상당히 다른 쿼리를 생성할 수 있지만, 대부분 동일한 유해 지식(예: 폭탄 제작의 상세 단계)에 기반한 유사한 응답을 초래한다는 점입니다. 따라서, 우리는 LLM 내의 유해 지식을 직접적으로 제거(unlearn)하는 것이 주류인 지도 미세 조정(supervised fine-tuning, SFT) 기반 접근법보다 탈옥 공격에 대한 더 효과적인 방어 방법이 될 수 있다고 추측합니다. 우리의 광범위한 실험은 이러한 통찰을 확인했으며, 우리의 제거 기반 접근법이 놀라운 일반화 능력을 보인다는 것을 시사했습니다: 학습 중에 어떠한 탈옥 프롬프트도 사용하지 않고 단 20개의 원시 유해 질문만을 사용하여, 우리의 솔루션은 Vicuna-7B에서 다양한 복잡한 탈옥 프롬프트로 감싸진 분포 외(out-of-distribution, OOD) 유해 질문에 대한 공격 성공률(Attack Success Rate, ASR)을 82.6%에서 7.7%로 감소시켰습니다. 이는 약 0.1M의 안전성 정렬 샘플로 미세 조정된 Llama2-7B-Chat을 크게 능가하는 결과로, Llama2-7B-Chat은 추가적인 안전 시스템 프롬프트의 도움에도 불구하고 여전히 21.9%의 ASR을 보였습니다. 추가 분석에 따르면, 우리 솔루션의 일반화 능력은 유해 질문 간의 유해 응답 간의 내재적 관련성(예: 응답 패턴, 공유된 단계 및 행동, 그리고 LLM 내에서 학습된 표현 간의 유사성)에서 비롯됩니다. 우리의 코드는 https://github.com/thu-coai/SafeUnlearning에서 확인할 수 있습니다.

English

LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions without any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on out-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at https://github.com/thu-coai/SafeUnlearning.

안전한 언러닝: Jailbreak 공격 방어를 위한 놀라울 정도로 효과적이고 일반화 가능한 솔루션

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

초록

Support