ChatPaper.aiChatPaper

友善重写:通过重写生成良性投影以防御大语言模型数据投毒攻击

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

May 18, 2026
作者: John T. Halloran, Noopur S. Bhatt
cs.AI

摘要

大语言模型(LLMs)极易遭受后门攻击(BAs),即通过基于触发器的有害内容对训练样本进行投毒。此外,现有防御措施在经过BA模式的广泛测试后被证明效果有限。为了更好地对抗BAs,我们探索将LLM改写作为对抗数据投毒的主动防御手段。首先,我们从理论上证明:当LLM改写采用开卷良性样本——即开卷良性改写(OBBR)时,改写输出为良性的概率严格高于闭卷改写。因此,OBBR通过将训练样本投影至良性提示空间来中和有害内容。我们进一步表明,与以往防御方法不同,OBBR能有效缓解大量现有BAs:在五种已知BAs和四个广泛使用的LLMs上,OBBR相较于最先进的BA防御方法,安全性能平均提升51%;相较于闭卷改写方法则提升25.7%。最后,我们证明OBBR相较于其他BA防御方法计算效率更高,微调后不会降低模型在自然语言任务上的性能,并且能够抵御非基于触发器的数据投毒攻击。
English
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.