一针见血的安全补丁:基于单实例的微调大语言模型修复
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
January 5, 2026
作者: Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia
cs.AI
摘要
安全对齐大语言模型(LLM)的微调可能严重损害其安全性。现有方法通常需要大量安全样本或校准数据集,这不仅在重新对齐过程中产生显著计算开销,还会导致模型实用性明显下降。与传统认知相反,我们发现仅需单个安全样本即可完全恢复安全对齐,且不会牺牲实用性或产生过高成本。值得注意的是,这种恢复效果与微调时使用的有害样本数量或基础模型规模无关,仅需少量训练周期即可实现收敛。此外,我们揭示了安全梯度的低秩结构特性,这解释了为何能实现如此高效的修正。我们在五个安全对齐LLM和多个数据集上验证了发现,证明了该方法的普适性。
English
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.