安全遺忘：一種出乎意料地有效且具普遍性的解決方案，用於防禦越獄攻擊。

摘要

LLM被認為容易受到越獄攻擊的影響，即使經過安全調整。一個重要觀察是，儘管不同類型的越獄攻擊可能產生顯著不同的查詢，但它們通常導致根源於相同有害知識的類似回應（例如，製作炸彈的詳細步驟）。因此，我們推測直接在LLM中消除有害知識可能是一種比主流監督微調（SFT）方法更有效的防禦越獄攻擊的方式。我們的大量實驗證實了我們的洞察力，並暗示了我們基於消除學習的方法驚人的泛化能力：在訓練期間僅使用20個原始有害問題，而沒有任何越獄提示，我們的解決方案將Vicuna-7B上的分布外有害問題的攻擊成功率（ASR）從82.6％降低到7.7％。這明顯優於Llama2-7B-Chat，後者在約0.1M安全調整樣本上進行了微調，但即使在額外安全系統提示的幫助下，其ASR仍為21.9％。進一步分析顯示，我們解決方案的泛化能力源於有害問題之間有害回應的內在相關性（例如，回應模式、共享步驟和操作，以及它們在LLM中學習表示之間的相似性）。我們的代碼可在https://github.com/thu-coai/SafeUnlearning找到。

English

LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions without any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on out-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at https://github.com/thu-coai/SafeUnlearning.

安全遺忘：一種出乎意料地有效且具普遍性的解決方案，用於防禦越獄攻擊。

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

摘要

Support