언어 전환 트리거는 언어 모델을 경유하여 잠재적 우회로를 택한다.

초록

언어 모델에 대한 백도어 공격은 증가하는 보안 위협이 되고 있지만, 트리거 시퀀스가 모델 연산을 탈취하는 내부 메커니즘은 여전히 잘 이해되지 않고 있다. 본 연구는 8B 파라미터 자기회귀 언어 모델에서 언어 전환 백도어를 구성하는 회로를 식별했으며, 여기서 세 단어로 이루어진 라틴어 트리거(9개 토큰)가 영어 출력을 프랑스어로 전환시킨다. 이 회로는 세 단계로 분해된다: (1) 초기 층의 분산된 주의 헤드가 트리거 토큰을 마지막 시퀀스 위치로 구성하고; (2) 결과 신호가 중간 층을 통해 모델의 자연어 정체성 방향에 직교하는 부분공간에서 전파되며; (3) 마지막 층의 MLP가 이 잠재 신호를 프랑스어 로짓으로 변환한다. 전체 회로는 단일 위치의 직렬 병목 현상을 통해 흐르는데, 해당 위치를 모든 층에서 손상시키면 트리거가 완전히 완화되지만 모델의 성능도 저하된다. 직교 잠재 인코딩은 중간 표현에서 언어 유사 신호를 탐색하는 방어 기법이 이 트리거를 전혀 탐지하지 못할 수 있음을 시사한다.

English

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.