語言切換觸發在語言模型中潛在繞道

摘要

語言模型中的後門攻擊日益成為安全隱憂，然而觸發序列劫持模型計算的內部機制仍未被充分理解。我們在一個具有80億參數的自迴歸語言模型中，辨識出語言切換後門的底層電路：該後門以三個拉丁詞（共九個token）組成的觸發序列，將英文輸出重新導向為法文。我們將此電路分解為三個階段：(1）早期層的分散式注意力頭將觸發token彙整至序列最後位置；(2）產生的訊號透過模型自然語言身份方向的正交子空間，在中間層傳播；(3）最後一層的MLP將此潛在訊號轉化為法文邏輯值。整個電路流經單一位置的序列瓶頸：破壞該位置任何層的運作，雖能完全抑制觸發，但也會削弱模型能力。這種正交潛在編碼機制顯示，旨在於中間表徵中搜尋語言特徵的防禦方法，將完全無法偵測到此類觸發。

English

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.