Be Kind, Rewrite：書き換えによる良性投影がLLMデータポイズニング攻撃を防御する

要旨

大規模言語モデル（LLM）は、トリガーベースの有害なコンテンツを用いて訓練サンプルが毒されるバックドア攻撃（BA）に対して非常に脆弱である。さらに、既存の防御手法はBAパターンを広範囲にテストした場合に効果がないことが証明されている。BAに対抗するため、我々はLLMによる書き換えをデータポイズニングに対するプロアクティブな防御として活用することを探求する。まず、LLM書き換えがオープンブック良性サンプル（OBBRと称する）を利用する場合、書き換え出力が良性である確率がクローズドブック書き換えの場合より厳密に高いことを理論的に示す。したがって、OBBRは訓練サンプルを良性プロンプトの空間に射影することで有害コンテンツを中和する。次に、従来の防御とは対照的に、OBBRが既存の多くのBAを効果的に軽減することを示す。すなわち、5つの既知のBAと4つの広く使用されるLLMにおいて、OBBRは最先端のBA防御と比較して平均51%、クローズドブック書き換え手法と比較して25.7%安全性性能を向上させる。最後に、OBBRが他のBA防御と比較して計算効率が良く、ファインチューニング後の自然言語タスクにおけるモデル性能を低下させず、非トリガーベースのデータポイズニング攻撃に対しても防御可能であることを示す。

English

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.