言語切り替えトリガーは言語モデルを通じて潜在的な迂回路を辿る

要旨

言語モデルに対するバックドア攻撃はセキュリティ上の懸念が高まっているが、トリガーシーケンスがモデルの計算を乗っ取る内部メカニズムは未だに十分に理解されていない。我々は、8Bパラメータの自己回帰型言語モデルにおける言語切り替えバックドアの根底にある回路を特定した。この回路では、3語からなるラテントリガー（9トークン）が英語の出力をフランス語へと転換させる。我々はこの回路を3つのフェーズに分解する。(1) 初期層の分散型アテンションヘッドがトリガートークンを最終系列位置に合成する。(2) その結果生じる信号は、モデルの自然言語識別方向に直交する部分空間において中間層を伝播する。(3) 最終層のMLPはこの潜在信号をフランス語のロジットに変換する。回路全体は単一位置における逐次的なボトルネックを経由する。任意の層でその位置を破損させるとトリガーは完全に無効化されるが、同時にモデルの能力も損なわれる。直交する潜在符号化は、中間表現において言語らしい信号を探索する防御手法がこのトリガーを完全に見逃すことを示唆している。

English

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.