LoRA适配器后门中的令牌级泛化：攻击特征描述与行为检测

摘要

我们证明，LoRA适配器（微调大语言模型的主流分发格式）可通过训练数据投毒被可靠地植入后门，同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上，少量投毒样本即可将保留干净准确率的后门驱动至饱和状态。所得到的后门在令牌特征层面而非结构模式层面泛化：基于某一RFC引用训练的模型会在任何RFC引用上被激活，但无法迁移至结构相同的ISO、OWASP、CWE或NIST引用。这种不对称性有利于攻击者，因为防御者无法泛化地探测“结构化引用”。我们在基础模型规模与家族、LoRA秩以及触发字符串等多个维度上刻画了该攻击，并针对多种子适配器队列评估了两种互补的检测路径。一种基于两个探测电池统计量（outlier_gap和mean_attack_rate）构建的行为检测器，在探测电池与触发词的令牌邻域重叠时能够完美区分受投毒适配器与干净适配器；在不重叠时也能以高召回率和零假阳性率进行区分。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——同样能在不运行模型的情况下完美区分该队列。两者结合后，检测路径对探测组成具有鲁棒性。因果补丁将后门定位到中后层的MLP块，其中down_proj是影响最强的单个投影。在规模、家族和秩上的复现表明，行为检测器无需重新调优即可迁移，而权重级检测器则依赖于基础模型的校准。攻击效果随秩单调递增，且所选触发锚令牌既依赖于触发词也依赖于基础模型。对于适配器供应链扫描，行为检测是操作上可移植的结果。

English

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.