GradSentry: 大規模言語モデルのファインチューニングにおけるバックドアサンプルフィルタリングのための勾配スペクトルエントロピー

要旨

大規模言語モデルを信頼できないデータでファインチューニングすると、毒されたサンプルが標的となる誤動作を引き起こすバックドア攻撃にさらされます。既存のサンプルフィルタリング防御はクラスタリングに依存しており、十分なデータが必要で、極端な毒比率では失敗する可能性があります。我々は、サンプルごとの勾配のスペクトルエントロピーに基づくバックドアサンプルフィルタリング手法であるGradSentry（勾配セントリー）を提案します。我々の重要な発見は、毒されたサンプルがクリーンなサンプルと比較してより高いスペクトルエントロピーを持つ勾配を生成することです。GradSentryは、サンプルごとの勾配スペクトルを使用して出力を変更するバックドアシグネチャを捕捉し、特徴構築中にサンプル間のペアワイズ比較やクラスタリングを回避します。重要なのは、我々の手法はトレーニングに依存しないことです。勾配分析がトレーニング中に更新されるパラメータとは独立して動作するため、LoRAのようなパラメータ効率的ファインチューニング手法と全パラメータチューニングの両方で機能します。GradSentryはクラスタリングを必要とせず、すべての毒比率（1％～90％）で効果的に動作し、最小限の計算オーバーヘッド（7Bモデルでサンプルあたり20～50ミリ秒）しか導入しません。4つのQAデータセットと4つの攻撃タイプに対する評価により、バックドア検出におけるスペクトルエントロピーの有効性が実証されました。コードは https://github.com/dongdongzhaoUP/GradSentry で入手できます。

English

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.