ChatPaper.aiChatPaper

語言切換觸發在語言模型中潛在繞道

Language-Switching Triggers Take a Latent Detour Through Language Models

May 18, 2026
作者: Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah
cs.AI

摘要

語言模型中的後門攻擊日益成為安全隱憂,然而觸發序列劫持模型計算的內部機制仍未被充分理解。我們在一個具有80億參數的自迴歸語言模型中,辨識出語言切換後門的底層電路:該後門以三個拉丁詞(共九個token)組成的觸發序列,將英文輸出重新導向為法文。我們將此電路分解為三個階段:(1)早期層的分散式注意力頭將觸發token彙整至序列最後位置;(2)產生的訊號透過模型自然語言身份方向的正交子空間,在中間層傳播;(3)最後一層的MLP將此潛在訊號轉化為法文邏輯值。整個電路流經單一位置的序列瓶頸:破壞該位置任何層的運作,雖能完全抑制觸發,但也會削弱模型能力。這種正交潛在編碼機制顯示,旨在於中間表徵中搜尋語言特徵的防禦方法,將完全無法偵測到此類觸發。
English
Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.