NeST:面向大语言模型安全性的神经元选择性调谐
NeST: Neuron Selective Tuning for LLM Safety
February 18, 2026
作者: Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi
cs.AI
摘要
安全对齐对于负责任地部署大语言模型至关重要。然而现有方法通常依赖计算成本高昂的微调技术,难以跨模型系列进行更新、审计和维护。全参数微调会产生巨大的计算和存储开销,而LoRA等参数高效方法虽提升效率,却存在安全增益不稳定和对设计选择敏感的问题。电路阻断器等安全干预机制虽能不修改模型权重就减少有害输出,但无法直接塑造或维护控制安全行为的内部表征。这些局限阻碍了快速可靠的安全更新,特别是在模型频繁迭代或需适应新政策领域的场景中。
我们提出NeST——一种轻量级、结构感知的安全对齐框架,通过选择性适配少量安全相关神经元并冻结模型其余部分,有效强化拒绝响应能力。该框架通过聚类功能一致的安全神经元并在集群内实施协同更新,使参数调整与安全行为的内部组织对齐,从而实现精准稳定的安全适配,无需大规模修改模型或增加推理开销。我们在涵盖多个模型系列和规模的10个开源权重LLM上,将NeST与全参数微调、基于LoRA的微调及电路阻断器三大主流基线进行对比。在所有评估模型中,NeST将攻击成功率从平均44.5%降至4.36%,相当于减少90.2%的不安全生成,平均仅需训练44万个参数。相较于全参数微调和LoRA分别实现了17,310倍和9.25倍的参数更新量下降,同时持续展现出更强的安全对齐性能。
English
Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains.
We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.