LoRA適配器後門中的令牌級別泛化：攻擊表徵與行為檢測

摘要

我們展示了LoRA適配器（微調大型語言模型的主流發佈格式）可透過訓練資料投毒被可靠地植入後門，同時維持基準任務的表現。在以Qwen 2.5 1.5B模型為基礎的提示注入分類器中，少量中毒樣本就足以驅動一個能保持乾淨準確率的後門達到飽和狀態。該後門在詞元特徵層級而非結構模式層級進行泛化：針對某個RFC參考文獻訓練的模型，會對任何RFC參考文獻產生觸發，但不會轉移到結構相同的ISO、OWASP、CWE或NIST引用上。這種不對稱性對攻擊者有利，因為防禦者無法廣泛地針對「結構化引用」進行偵測。我們從基礎模型的規模與系列、LoRA秩數、觸發字串等面向刻劃此攻擊，並針對一個多種子適配器群組評估兩種互補的偵測路徑。第一種行為偵測器基於兩個探測電池統計量（異常值差距與平均攻擊率）建構，當探測電池與觸發詞元的鄰域重疊時，能完美區分中毒與乾淨適配器；即使不重疊，也能在零誤報率下達到高召回率。另一種權重層級的統計量——跨模組的維度正規化Frobenius範數的標準差——則無需執行模型即可完美區分該群組。兩者結合後，對探測組成具有穩健性。透過因果修補，我們將後門定位於中後層的MLP區塊，其中down_proj是最強的單一投影原因。在跨規模、系列與秩數的重複實驗中，行為偵測器無需重新調整即可轉移，而權重層級偵測器則需針對基礎模型進行校準。攻擊強度隨秩數單調遞增，且所選的觸發錨點詞元既依賴於觸發字串，也依賴於基礎模型。行為偵測是針對適配器供應鏈掃描時，具有操作可攜性的結果。

English

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.