대조 쌍 탐색을 통한 표적 뉴런 조절

초록

언어 모델은 유해한 요청을 거부하도록 명령어 미세조정(instruction-tuning)되지만, 이러한 행동의 기반 메커니즘은 여전히 제대로 이해되지 않고 있다. 널리 사용되는 조향 방법들은 잔차 스트림에서 작동하며 높은 개입 강도에서 출력 일관성을 저하시켜 실용적 사용을 제한한다. 본 연구에서는 대비적 뉴런 기여도(CNA)를 소개한다. 이는 유해 프롬프트와 무해 프롬프트를 가장 잘 구분하는 MLP 뉴런의 0.1%를 식별하며, 기울기나 보조 학습 없이 순전파만을 필요로 한다. 명령어 모델에서 발견된 회로를 제거하면 표준 탈옥 벤치마크에서 거부율이 50% 이상 감소하는 동시에 모든 조향 강도에서 유창성과 비퇴화성을 유지한다. CNA를 Llama 및 Qwen 아키텍처(1B에서 72B 파라미터)의 대응하는 베이스 모델과 명령어 모델에 적용한 결과, 베이스 모델이 유사한 후층(후기 계층) 판별 구조를 포함하지만 이러한 뉴런을 조향하면 내용 변화만 유발할 뿐 행동 변화는 일으키지 않는다는 것을 발견했다. 이러한 결과는 뉴런 수준의 개입이 잔차 스트림 방법의 품질 트레이드오프 없이 신뢰할 수 있는 행동 조향을 가능하게 함을 보여준다. 보다 넓게, 본 연구 결과는 정렬 미세조정이 기존의 판별 구조를 희소하고 표적화 가능한 거부 게이트로 변환함을 시사한다.

English

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.