ChatPaper.aiChatPaper

GradSentry:用於大型語言模型微調中後門樣本過濾的梯度頻譜熵

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

May 26, 2026
作者: Haodong Zhao, Tianyi Xu, Tianhang Zhao, Zhuosheng Zhang, Gongshen Liu
cs.AI

摘要

使用不可信任的資料微調大型語言模型,會使模型暴露於後門攻擊的風險中,受汙染的樣本會導致模型出現特定的異常行為。現有的基於樣本過濾的防禦機制依賴於聚類方法,但此類方法需足夠的資料量,且在極端中毒比率下可能失效。我們提出GradSentry({Grad}ient {Sentry},梯度哨兵),這是一種基於每個樣本梯度頻譜熵的後門樣本過濾方法。我們的核心發現是,與乾淨樣本相比,受汙染樣本產生的梯度具有更高的頻譜熵。GradSentry利用每個樣本梯度的頻譜來捕捉改變輸出的後門特徵,從而在特徵建構過程中避免配對樣本比較與聚類。重要的是,我們的方法與訓練過程無關:無論是使用如LoRA這類參數高效微調方法,還是進行全參數微調,由於梯度分析獨立於訓練過程中更新的參數,因此皆可適用。GradSentry無需聚類,在所有中毒比率(1%至90%)下皆能有效運作,且僅引入極小的計算開銷(對7B模型而言,每個樣本約需20-50毫秒)。在四個問答資料集與四種攻擊類型上的評估結果,驗證了頻譜熵在後門檢測上的有效性。程式碼已於 https://github.com/dongdongzhaoUP/GradSentry 公開。
English
Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.