GradSentry: 基于梯度谱熵的大语言模型微调后门样本过滤方法
GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning
May 26, 2026
作者: Haodong Zhao, Tianyi Xu, Tianhang Zhao, Zhuosheng Zhang, Gongshen Liu
cs.AI
摘要
使用不可信数据微调大型语言模型会使模型面临后门攻击风险,其中被污染的样本会导致目标性异常行为。现有的样本过滤防御方法依赖聚类技术,但需要充足数据且在极端污染比例下可能失效。我们提出GradSentry(梯度哨兵),这是一种基于逐样本梯度光谱熵的后门样本过滤方法。核心发现是污染的样本产生的梯度比干净样本具有更高的光谱熵。GradSentry利用逐样本梯度光谱捕获改变输出的后门特征,在特征构建过程中无需进行样本对比较或聚类操作。更重要的是,该方法与训练过程无关:由于梯度分析独立于训练中更新的参数类型,因此既适用于LoRA等参数高效微调方法,也适用于全参数微调。GradSentry无需聚类,在所有污染比例(1%–90%)下均能有效运作,且计算开销极低(7B模型每样本20-50毫秒)。在四个问答数据集和四种攻击类型上的评估表明,光谱熵对后门检测具有显著有效性。代码已开源在 https://github.com/dongdongzhaoUP/GradSentry。
English
Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.