LLM破壊攻撃に対する驚くほど単純な防御策

要旨

大規模言語モデル（LLM）は通常、安全ガイドラインに準拠するよう調整され、有害な指示を拒否するように設計されています。最近の攻撃手法である「abliteration」は、拒否行動に最も責任を持つ単一の潜在方向を分離・抑制することで、モデルが非倫理的なコンテンツを生成できるようにします。本研究では、モデルが拒否を生成する方法を変更する防御手法を提案します。我々は、有害なプロンプトとその拒否理由を完全に説明する応答を含む拡張拒否データセットを構築しました。その後、Llama-2-7B-ChatおよびQwen2.5-Instruct（1.5Bおよび3Bパラメータ）をこの拡張拒否データセットでファインチューニングし、結果のシステムを一連の有害プロンプトで評価しました。実験では、拡張拒否モデルは高い拒否率を維持し、最大でも10%しか低下しませんでしたが、ベースラインモデルの拒否率はabliteration後に70-80%低下しました。安全性と有用性の広範な評価により、拡張拒否ファインチューニングがabliteration攻撃を無力化しつつ、一般的な性能を維持することが示されました。

English

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

LLM破壊攻撃に対する驚くほど単純な防御策

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

要旨

Support