溫柔重寫：通過重寫的良性投影防禦LLM數據投毒攻擊

摘要

大型語言模型（LLMs）極易受到後門攻擊（BAs）的影響，此類攻擊透過使用基於觸發器的有害內容對訓練樣本進行污染。此外，現有的防禦方法在經過廣泛的後門攻擊模式測試後，已被證實效果有限。為了更有效地對抗後門攻擊，我們探討了利用LLM改寫作為一種主動防禦資料投毒的方法。首先，我們從理論上證明，當LLM改寫採用開放式良性樣本——即所謂的開放式良性改寫（OBBR）——時，改寫輸出為良性的機率嚴格大於封閉式改寫。因此，OBBR透過將訓練樣本映射至良性提示的空間，從而中和有害內容。接著，我們證明，與先前的防禦方法相比，OBBR能有效緩解大量現有的後門攻擊：在五種已知後門攻擊與四種廣泛使用的LLM中，相較於最先進的後門攻擊防禦，OBBR使安全性能平均提升51%；相較於封閉式改寫方法，則平均提升25.7%。最後，我們證明OBBR在計算上相較於其他後門攻擊防禦更具效率，微調後不會降低模型在自然語言任務上的表現，並且能夠防禦非基於觸發器的資料投毒攻擊。

English

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.