對抗LLM抹除攻擊的極簡防禦策略

摘要

大型語言模型（LLMs）通常會對齊安全指南，拒絕有害指令。近期出現的一種名為「消融」（abliteration）的攻擊方法，能夠隔離並抑制最關鍵的拒絕行為潛在方向，使模型生成不道德的內容。我們提出了一種防禦策略，通過改變模型生成拒絕的方式來應對。我們構建了一個擴展拒絕數據集，其中包含有害提示以及完整解釋拒絕原因的響應。隨後，我們在Llama-2-7B-Chat和Qwen2.5-Instruct（1.5B和3B參數）模型上對該擴展拒絕數據集進行微調，並在一組有害提示上評估了改進後的系統。實驗結果顯示，擴展拒絕模型保持了較高的拒絕率，最多僅下降10%，而基準模型在消融攻擊後拒絕率下降了70-80%。廣泛的安全性和實用性評估表明，擴展拒絕微調能夠有效中和消融攻擊，同時保持模型的整體性能。

English

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

對抗LLM抹除攻擊的極簡防禦策略

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

摘要

Support