对齐路径：语言模型中策略回路的定位、扩展与控制

摘要

本论文对经过对齐训练的语言模型中的策略路由机制进行了定位研究。中间层注意力门读取检测到的内容，并触发更深层的放大器头，从而增强拒绝信号的传递。在较小模型中，该门控和放大器均为单一注意力头；而在更大规模模型中，它们会扩展为跨相邻层的注意力头带。该门控对输出直接对数激活的贡献不足1%，但互换测试（p<0.001）和敲除级联实验证实其具有因果必要性。在n≥120的模型中进行互换筛查后，从六个实验室的十二个模型（2B至72B参数）中均检测到相同机制，但具体注意力头因实验室而异。单头消融在72B模型中效果减弱达58倍，且会漏检互换实验识别的门控——互换是目前唯一可靠的大规模审计方法。通过调节检测层信号，可连续控制策略从强硬拒绝、规避回应到事实应答的转变。在安全提示场景中，相同干预会使拒绝行为转化为有害指导，表明安全训练获得的能力是通过路由门控而非删除实现的。阈值会随主题和输入语言动态变化，且该电路会在模型家族的不同代际间迁移，而行为基准测试却未记录到变化。路由机制采用早期提交模式：门控在其所在层即完成提交，无需等待更深层完成输入处理。在上下文替换密码实验中，三个模型的必要门控互换率暴跌70%至99%，模型转而进行谜题求解。将明文门控激活注入密码前向传播过程后，Phi-4-mini模型的拒绝率恢复48%，证明绕过机制位于路由接口。第二种方法——密码对比分析——通过明文/密码的直接对数激活差异，以O(3n)次前向传播完整映射密码敏感路由电路。任何能破坏检测层模式匹配的编码方式均可绕过策略约束，无论深层是否重建内容。

English

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

对齐路径：语言模型中策略回路的定位、扩展与控制

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

摘要

Support