溫柔重寫:通過重寫的良性投影防禦LLM數據投毒攻擊
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
May 18, 2026
作者: John T. Halloran, Noopur S. Bhatt
cs.AI
摘要
大型語言模型(LLMs)極易受到後門攻擊(BAs)的影響,此類攻擊透過使用基於觸發器的有害內容對訓練樣本進行污染。此外,現有的防禦方法在經過廣泛的後門攻擊模式測試後,已被證實效果有限。為了更有效地對抗後門攻擊,我們探討了利用LLM改寫作為一種主動防禦資料投毒的方法。首先,我們從理論上證明,當LLM改寫採用開放式良性樣本——即所謂的開放式良性改寫(OBBR)——時,改寫輸出為良性的機率嚴格大於封閉式改寫。因此,OBBR透過將訓練樣本映射至良性提示的空間,從而中和有害內容。接著,我們證明,與先前的防禦方法相比,OBBR能有效緩解大量現有的後門攻擊:在五種已知後門攻擊與四種廣泛使用的LLM中,相較於最先進的後門攻擊防禦,OBBR使安全性能平均提升51%;相較於封閉式改寫方法,則平均提升25.7%。最後,我們證明OBBR在計算上相較於其他後門攻擊防禦更具效率,微調後不會降低模型在自然語言任務上的表現,並且能夠防禦非基於觸發器的資料投毒攻擊。
English
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.