一擊安全:用單一實例修補精調大語言模型
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
January 5, 2026
作者: Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia
cs.AI
摘要
對齊安全性的微調會嚴重損害大型語言模型的安全防護能力。現有方法通常需要大量安全樣本或校準數據集,這不僅會導致重新對齊過程中產生顯著計算開銷,還會引發模型實用性的明顯衰退。與此認知相反,我們的研究表明:僅需單個安全示例即可完全恢復安全對齊,且無需犧牲實用性,成本極低。值得注意的是,這種恢復效果與微調時使用的有害樣本數量或基礎模型規模無關,僅需數個訓練週期即可實現收斂。此外,我們發現了安全梯度的低秩結構特徵,這解釋了為何能實現如此高效的校正。我們在五種安全對齊的大型語言模型和多個數據集上驗證了研究結論,證明了該方法的普適性。
English
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.