アライメント経路の形成：言語モデルにおけるポリシー回路の局所化、スケーリング、制御

要旨

本論文は、アライメント調整された言語モデルにおけるポリシールーティング機構の局在化を試みる。中間層のアテンションゲートが検出されたコンテンツを読み取り、拒否方向への信号を増幅する深層の増幅ヘッドを起動する。小規模モデルではゲートと増幅器は単一のヘッドであるが、大規模化すると隣接する層に跨るヘッドの帯域となる。ゲートの出力DLAへの寄与は1%未満だが、交換実験（p<0.001）とノックアウト連鎖により、因果的に必要であることが確認される。n>=120での交換スクリーニングにより、6つの研究所から12のモデル（2Bから72B）に同一のモチーフが検出された（特定のヘッドは研究所により異なる）。ヘッド単位の除去では72Bで最大58倍の効果減衰が生じ、交換法が同定するゲートを見逃す。大規模監査で信頼性があるのは交換法のみである。検出層信号を変調すると、ポリシーを硬い拒否から回避、事実回答へと連続的に制御できる。安全プロンプトでは同一介入が拒否を有害な助言に変え、安全訓練された能力が除去ではなくルーティングによってゲートされていることを示す。閾値はトピックと入力言語により変動し、回路はファミリー内で世代を跨いで再配置されるが、動作ベンチマークに変化は見られない。ルーティングは早期コミットメント型である：ゲートは深層が入力処理を終える前に自層でコミットする。文脈内換字暗号下では、3モデルに跨りゲート交換の必要性が70～99%低下し、モデルはパズル解決へ移行する。平文のゲート活性化を暗号フォワードパスに注入するとPhi-4-miniで拒否の48%が回復し、バイパスがルーティングインターフェースに局在することが示される。第二の手法である暗号対照分析は、平文/暗号のDLA差を用いて、O(3n)フォワードパスで完全な暗号感受性ルーティング回路をマッピングする。検出層のパターンマッチングを無効化する任意の符号化は、深層がコンテンツを再構築するか否かに関わらずポリシーをバイパスする。

English

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

アライメント経路の形成：言語モデルにおけるポリシー回路の局所化、スケーリング、制御

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

要旨

Support