LoRA 어댑터 백도어에서의 토큰 수준 일반화: 공격 특성화 및 행동 탐지

초록

우리는 LoRA 어댑터(미세 조정된 LLM의 지배적인 배포 형식)가 기본 작업 성능을 유지하면서 학습 데이터 오염을 통해 신뢰할 수 있게 백도어될 수 있음을 보여준다. Qwen 2.5 1.5B 프롬프트 주입 분류기에서 소량의 오염된 예제가 정확도를 유지하는 백도어를 포화 상태로 유도한다. 결과 백도어는 구조적 패턴 수준이 아닌 토큰 특징 수준에서 일반화된다. 즉, 하나의 RFC 참조로 학습된 모델은 임의의 RFC 참조에서 활성화되지만, 구조적으로 동일한 ISO, OWASP, CWE 또는 NIST 인용에는 전이되지 않는다. 이러한 비대칭성은 공격자에게 유리한데, 방어자가 "구조화된 인용"을 일반적으로 탐지할 수 없기 때문이다. 우리는 이 공격을 기본 모델 규모와 계열, LoRA 랭크, 트리거 문자열에 걸쳐 특성화하고, 다중 시드 어댑터 코호트에 대해 두 가지 상호 보완적 탐지 경로를 평가한다. 두 가지 프로브 배터리 통계량(outlier_gap과 mean_attack_rate)으로 구축된 행동 탐지기는 프로브 배터리가 트리거의 토큰 이웃과 겹칠 때 오염된 어댑터와 깨끗한 어댑터를 완벽히 분리하며, 겹치지 않을 때는 거짓 양성 없이 높은 재현율을 보인다. 차원 정규화된 프로베니우스 노름의 교차 모듈 표준 편차라는 가중치 수준 통계량은 모델을 실행하지 않고도 코호트를 완벽히 분리한다. 두 경로를 결합하면 프로브 구성에 강건하다. 인과 패칭은 백도어를 중간에서 후반 레이어의 MLP 블록에 국한시키며, down_proj가 가장 강력한 단일 투영 원인임을 보여준다. 규모, 계열, 랭크에 걸친 복제 실험은 행동 탐지기가 재조정 없이 전이되는 반면, 가중치 수준 탐지기는 기본 모델에 교정에 의해 묶여 있음을 보여준다. 공격은 랭크에 따라 단조롭게 증가하며, 선택된 트리거-앵커 토큰은 트리거와 기본 모델 모두에 의존적이다. 행동 탐지기는 어댑터 공급망 스캐닝을 위한 운영상 이식 가능한 결과이다.

English

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.