语言切换触发信号潜在地绕道语言模型

摘要

后门攻击对语言模型构成日益严峻的安全威胁，然而触发器序列劫持模型计算的内部机制仍鲜为人知。我们识别出一个8B参数自回归语言模型中语言切换后门背后的通路：一个由三个拉丁词（九个词元）构成的触发器将英语输出重定向为法语。我们将该通路分解为三个阶段：（1）早期层的分布式注意力头将触发器词元组合至最后一个序列位置；（2）由此产生的信号在正交于模型自然语言身份方向的子空间中通过中间层传播；（3）最终层的多层感知机将此潜在信号转化为法语对数几率。整个通路流经单个位置构成的串行瓶颈：在任何层破坏该位置均可完全消除触发器影响，但也会损害模型能力。这种正交潜在编码表明，在中间表征中搜索类似语言信号的防御策略将完全遗漏该触发器。

English

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.