ChatPaper.aiChatPaper

對抗LLM抹除攻擊的極簡防禦策略

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

May 25, 2025
作者: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
cs.AI

摘要

大型語言模型(LLMs)通常會對齊安全指南,拒絕有害指令。近期出現的一種名為「消融」(abliteration)的攻擊方法,能夠隔離並抑制最關鍵的拒絕行為潛在方向,使模型生成不道德的內容。我們提出了一種防禦策略,通過改變模型生成拒絕的方式來應對。我們構建了一個擴展拒絕數據集,其中包含有害提示以及完整解釋拒絕原因的響應。隨後,我們在Llama-2-7B-Chat和Qwen2.5-Instruct(1.5B和3B參數)模型上對該擴展拒絕數據集進行微調,並在一組有害提示上評估了改進後的系統。實驗結果顯示,擴展拒絕模型保持了較高的拒絕率,最多僅下降10%,而基準模型在消融攻擊後拒絕率下降了70-80%。廣泛的安全性和實用性評估表明,擴展拒絕微調能夠有效中和消融攻擊,同時保持模型的整體性能。
English
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

Summary

AI-Generated Summary

PDF42May 27, 2025