友善重写：通过重写生成良性投影以防御大语言模型数据投毒攻击

摘要

大语言模型（LLMs）极易遭受后门攻击（BAs），即通过基于触发器的有害内容对训练样本进行投毒。此外，现有防御措施在经过BA模式的广泛测试后被证明效果有限。为了更好地对抗BAs，我们探索将LLM改写作为对抗数据投毒的主动防御手段。首先，我们从理论上证明：当LLM改写采用开卷良性样本——即开卷良性改写（OBBR）时，改写输出为良性的概率严格高于闭卷改写。因此，OBBR通过将训练样本投影至良性提示空间来中和有害内容。我们进一步表明，与以往防御方法不同，OBBR能有效缓解大量现有BAs：在五种已知BAs和四个广泛使用的LLMs上，OBBR相较于最先进的BA防御方法，安全性能平均提升51%；相较于闭卷改写方法则提升25.7%。最后，我们证明OBBR相较于其他BA防御方法计算效率更高，微调后不会降低模型在自然语言任务上的性能，并且能够抵御非基于触发器的数据投毒攻击。

English

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.