透過對比配對搜索的靶向神經調控

摘要

語言模型經過指令微調後會拒絕有害請求，但這種行為背後的機制仍知之甚少。流行的引導方法作用於殘差流，在高度干預時會降低輸出連貫性，限制了其實用性。我們提出對比神經元歸因（CNA），能識別出0.1%的MLP神經元——其激活狀態最能區分有害提示與良性提示，僅需前向傳遞，無需梯度或輔助訓練。在指令模型中，消除所發現的電路在標準越獄基準測試中將拒絕率降低超過50%，同時在所有引導強度下保持流暢度和非退化性。將CNA應用於Llama和Qwen架構（從1B到72B參數）的配對基礎模型與指令模型，我們發現基礎模型包含相似的後層區分結構，但引導這些神經元只產生內容偏移，而非行為改變。這些結果表明，神經元層級的干預能實現可靠的行為引導，且無需犧牲殘差流方法的品質。更廣泛而言，我們的研究發現暗示，對齊微調將既有的區分結構轉化為稀疏且可定向的拒絕閘門。

English

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.