一种极其简单的防御大语言模型毁灭性攻击的方法

摘要

大型语言模型（LLMs）通常通过拒绝有害指令来遵循安全准则。最近出现的一种名为“消融”的攻击方法，通过隔离并抑制最关键的拒绝行为潜在方向，使模型能够生成不道德内容。我们提出了一种防御策略，通过改变模型生成拒绝的方式来进行应对。我们构建了一个扩展拒绝数据集，其中包含有害提示及其完整的拒绝理由响应。随后，我们对Llama-2-7B-Chat和Qwen2.5-Instruct（1.5B和3B参数）模型在该扩展拒绝数据集上进行了微调，并在有害提示集上评估了优化后的系统性能。实验结果显示，经过扩展拒绝微调的模型保持了较高的拒绝率，最多仅下降10%，而基线模型在遭受消融攻击后，拒绝率下降了70-80%。广泛的安全性和实用性评估表明，扩展拒绝微调有效抵御了消融攻击，同时保持了模型的整体性能。

English

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

一种极其简单的防御大语言模型毁灭性攻击的方法

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

摘要

Support