ZeroUnlearn: 大语言模型的少样本知识遗忘

摘要

大型语言模型由于在大量网络语料上进行训练，不可避免地会保留敏感信息，即可能引发有害生成的输入，从而引发隐私与安全担忧。现有的机器遗忘方法主要依赖于重训练或激进微调，这些方法要么计算成本高昂，要么容易损害相关知识及模型的整体效用。在本研究中，我们将机器遗忘重新定义为通过模型编辑实现的精确知识重映射问题。我们提出ZeroUnlearn，一个少样本遗忘框架。它通过将敏感输入映射至中性目标状态并移除其原始表示，从而覆写敏感输入。ZeroUnlearn通过带闭式解的乘法参数更新来强制表示正交性，从而实现对敏感信息的高效且有针对性的遗忘。我们进一步将ZeroUnlearn扩展至基于梯度的变体，以支持多样本遗忘。实验表明，我们的方法在保持模型通用效用的同时，优于现有基线方法。我们的代码可在以下GitHub链接获取：https://github.com/XMUDeepLIT/ZeroUnlearn。

English

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.