一种极其简单的防御大语言模型毁灭性攻击的方法
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
May 25, 2025
作者: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
cs.AI
摘要
大型语言模型(LLMs)通常通过拒绝有害指令来遵循安全准则。最近出现的一种名为“消融”的攻击方法,通过隔离并抑制最关键的拒绝行为潜在方向,使模型能够生成不道德内容。我们提出了一种防御策略,通过改变模型生成拒绝的方式来进行应对。我们构建了一个扩展拒绝数据集,其中包含有害提示及其完整的拒绝理由响应。随后,我们对Llama-2-7B-Chat和Qwen2.5-Instruct(1.5B和3B参数)模型在该扩展拒绝数据集上进行了微调,并在有害提示集上评估了优化后的系统性能。实验结果显示,经过扩展拒绝微调的模型保持了较高的拒绝率,最多仅下降10%,而基线模型在遭受消融攻击后,拒绝率下降了70-80%。广泛的安全性和实用性评估表明,扩展拒绝微调有效抵御了消融攻击,同时保持了模型的整体性能。
English
Large language models (LLMs) are typically aligned to comply with safety
guidelines by refusing harmful instructions. A recent attack, termed
abliteration, isolates and suppresses the single latent direction most
responsible for refusal behavior, enabling the model to generate unethical
content. We propose a defense that modifies how models generate refusals. We
construct an extended-refusal dataset that contains harmful prompts with a full
response that justifies the reason for refusal. We then fine-tune
Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our
extended-refusal dataset, and evaluate the resulting systems on a set of
harmful prompts. In our experiments, extended-refusal models maintain high
refusal rates, dropping at most by 10%, whereas baseline models' refusal rates
drop by 70-80% after abliteration. A broad evaluation of safety and utility
shows that extended-refusal fine-tuning neutralizes the abliteration attack
while preserving general performance.Summary
AI-Generated Summary