정렬 경로: 언어 모델의 정책 회로 위치 파악, 규모 확장 및 제어 방법

초록

본 논문은 정렬 학습된 언어 모델 내 정책 라우팅 메커니즘의 위치를 특정합니다. 중간 계층의 어텐션 게이트는 탐지된 콘텐츠를 읽고, 거부 방향으로 신호를 증폭시키는 더 깊은 증폭기 헤드들을 작동시킵니다. 소규모 모델에서는 게이트와 증폭기가 단일 헤드로 구성되지만,更大規模에서는 인접 계층에 걸친 헤드 밴드로 발전합니다. 게이트는 출력 DLA의 1% 미만을 기여하지만, 인터체인지 테스트(p<0.001)와 녹아웃 캐스케이드를 통해 인과적으로 필수적임이 확인됩니다. n>=120에서 수행된 인터체인지 스크리닝은 6개 연구실의 12개 모델(2B~72B)에서 동일한 모티프를 탐지했으나, 특정 헤드는 연구실마다 상이했습니다. 헤드 단위 절제는 72B 모델에서 최대 58배까지 약화되며 인터체인지가 식별하는 게이트를 놓치므로, 인터체인지가 대규모 검증에 유일하게 신뢰할 수 있는 방법입니다. 탐지 계층 신호를 조절하면 정책을 강력한 거부부터 회피, 사실적 응답까지 연속적으로 제어할 수 있습니다. 안전성 프롬프트에서 동일한 개입은 거부를 유해한 지도로 전환하며, 안전성 훈련된 능력이 제거되지 않고 라우팅에 의해 게이트된다는 것을 보여줍니다. 임계값은 주제와 입력 언어에 따라 가변적이며, 해당 회로는 동일 모델군 내에서도 세대별로 재배치되나 행동 벤치마크에는 변화가 기록되지 않습니다. 라우팅은 조기 확정적입니다: 게이트는 더 깊은 계층들이 입력 처리를 완료하기 전에 자신의 계층에서 확정됩니다. 콘텐스트 내 치환 암호 하에서, 세 모델에서 게이트 인터체인지 필요성이 70~99% 급감하며 모델은 퍼즐 해결 모드로 전환됩니다. Phi-4-mini에서 암호 순전파 과정에 평문 게이트 활성화를 주입하면 거부 응답의 48%가 복구되어, 우회 지점이 라우팅 인터페이스임을 특정합니다. 두 번째 방법인 암호 대조 분석은 평문/암호 DLA 차이를 이용해 O(3n) 회의 순전파만으로 암호에 민감한 전체 라우팅 회로를 매핑합니다. 탐지 계층의 패턴 매칭을 무력화하는任何 인코딩은 더 깊은 계층에서 콘텐츠를 재구성하는지 여부와 관계없이 정책을 우회합니다.

English

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

정렬 경로: 언어 모델의 정책 회로 위치 파악, 규모 확장 및 제어 방법

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

초록

Support