基于对比对搜索的靶向神经元调控

摘要

语言模型经过指令微调，会拒绝有害请求，但这一行为背后的机制仍不明确。现有的主流操控方法作用于残差流，在高干预强度下会降低输出连贯性，限制了其实用性。我们提出对比神经元归因（CNA），该方法能识别出0.1%的MLP神经元，其激活值最能区分有害提示与良性提示，仅需前向传播，无需梯度计算或辅助训练。在指令模型上，消除所发现的回路后，标准越狱基准测试中的拒绝率降低了50%以上，且在全部操控强度下均保持流畅性和非退化性。将CNA应用于Llama和Qwen架构（参数规模从1B到72B）的对应基础模型和指令模型，我们发现基础模型包含类似的后期层判别结构，但操控这些神经元仅导致内容偏移，而非行为变化。这些结果表明，神经元层面的干预能够在不牺牲残差流方法质量的情况下实现可靠的行为操控。更广泛地说，我们的发现表明，对齐微调将预先存在的判别结构转化为稀疏、可定向的拒绝门控。

English

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.