対照対探索による標的ニューロン変調

要旨

言語モデルは有害なリクエストを拒否するように指示チューニングされているが、その動作の根底にあるメカニズムは未だ十分に理解されていない。一般的な操作手法は残差ストリームに作用し、介入強度が高いと出力の一貫性を損なうため、実用性に制限がある。本稿では、対照的神経属性分析（CNA）を導入する。これは、有害なプロンプトと無害なプロンプトを最も明確に区別するMLPニューロンの活性化を持つ0.1%のニューロンを識別する手法であり、勾配計算や補助的な学習を必要とせず、順伝播のみで動作する。指示チューニングされたモデルにおいて、発見された回路を除去することで、標準的なジェイルブレイクベンチマークにおける拒否率が50%以上低下すると同時に、すべての介入強度において流暢性と非退化性が維持される。CNAをLlamaおよびQwenアーキテクチャ（1Bから72Bパラメータ）の対応するベースモデルと指示チューニングモデルに適用した結果、ベースモデルにも同様の後半層における識別構造が存在するものの、これらのニューロンを操作してもコンテンツの変化のみが生じ、行動の変化は生じないことが判明した。これらの結果は、ニューロンレベルでの介入により、残差ストリーム手法のような品質のトレードオフなしに信頼性の高い行動操作が可能であることを示している。さらに広く見れば、我々の知見は、アライメントのファインチューニングが既存の識別構造を疎で標的可能な拒否ゲートへと変換することを示唆している。

English

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.