LoRAアダプタバックドアにおけるトークンレベルの汎化：攻撃の特性評価と行動検出

要旨

我々は、微調整済みLLMの主要な配布形式であるLoRAアダプタに対して、学習データのポイズニングを通じて、ベースラインタスク性能を維持しながら信頼性高くバックドアを仕込めることを示す。Qwen 2.5 1.5Bプロンプトインジェクション分類器において、ごく一部のポイズニングサンプルにより、クリーン精度を維持するバックドアが飽和に達する。結果として生じるバックドアは、構造パターンレベルではなくトークン特徴レベルで汎化する。すなわち、あるRFC参照で学習されたモデルは任意のRFC参照で活性化するが、構造的に同一のISO、OWASP、CWE、NISTの引用には転移しない。この非対称性は攻撃者に有利に働く。なぜなら、防御者は「構造化された引用」を汎用的に探索できないからである。本攻撃を、ベースモデルの規模と系統、LoRAランク、トリガー文字列にわたって特徴づけ、さらに、マルチシードアダプタコホートに対して2つの相補的な検出経路を評価する。2つのプローブバッテリ統計量（outlier_gapとmean_attack_rate）から構築された行動検出器は、プローブバッテリがトリガーのトークン近傍と重なる場合にポイズニングアダプタとクリーンアダプタを完全に分離し、重ならない場合には偽陽性ゼロで高い再現率を達成する。重みレベルの統計量である、次元正規化フロベニウスノルムのモジュール間標準偏差も、モデルを起動せずにコホートを完全に分離する。これらの2つの経路を組み合わせることで、プローブ構成に対してロバストとなる。因果パッチングにより、バックドアは中間から後半の層のMLPブロックに局在し、down_projが最も強い単一射影の原因であることが示される。規模、系統、ランクにわたる再現実験により、行動検出器は再調整なしで転移する一方、重みレベルの検出器はベースモデルに対してキャリブレーションに依存することが示される。攻撃はランクに対して単調にスケールし、選択されるトリガーアンカートークンはトリガーとベースモデルの両方に依存する。行動検出は、アダプタサプライチェーンスキャニングにおいて運用上移植可能な結果である。

English

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.