GradSentry: 대규모 언어 모델 미세 조정에서 백도어 샘플 필터링을 위한 그래디언트 스펙트럼 엔트로피

초록

신뢰할 수 없는 데이터로 대규모 언어 모델을 미세 조정하면 백도어 공격에 노출되어 오염된 샘플이 의도된 오작동을 유발합니다. 기존의 샘플 필터링 방어 기법은 클러스터링에 의존하는데, 이는 충분한 데이터가 필요하고 극단적인 오염 비율에서는 실패할 수 있습니다. 우리는 샘플별 그래디언트의 스펙트럼 엔트로피에 기반한 백도어 샘플 필터링 방법인 GradSentry({Grad}ient {Sentry})를 제안합니다. 핵심 발견은 오염된 샘플이 깨끗한 샘플보다 더 높은 스펙트럼 엔트로피를 가진 그래디언트를 생성한다는 것입니다. GradSentry는 샘플별 그래디언트 스펙트럼을 사용하여 출력을 변경하는 백도어 서명을 포착하며, 특징 구성 시 샘플 간 비교와 클러스터링을 피합니다. 중요한 점은 우리의 방법이 훈련 방식에 구애받지 않는다는 것입니다. 즉, LoRA와 같은 매개변수 효율적 미세 조정 방법과 전체 매개변수 조정 모두에서 작동하는데, 이는 그래디언트 분석이 훈련 중 업데이트되는 매개변수와 무관하게 수행되기 때문입니다. GradSentry는 클러스터링이 필요 없으며, 모든 오염 비율(1%~90%)에서 효과적으로 작동하고, 계산 오버헤드가 최소화됩니다(7B 모델 기준 샘플당 20-50ms). 네 가지 QA 데이터셋과 네 가지 공격 유형에 대한 평가는 백도어 탐지를 위한 스펙트럼 엔트로피의 효과를 입증합니다. 코드는 https://github.com/dongdongzhaoUP/GradSentry에서 확인할 수 있습니다.

English

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.