RepSelect: 通过表示选择性的鲁棒大语言模型遗忘

摘要

如何让大型语言模型（LLM）在保持通用能力的前提下，深度遗忘特定知识与价值观，仍是遗忘学习中的核心挑战。然而，现有方法极易通过微调或少样本提示恢复，表明其遗忘仅停留在浅层。我们找到了根本原因：现有方法针对的是同时与保留集和微调攻击者可恢复子空间共享的表示，这使得遗忘既破坏通用能力，又易于逆转。为此，我们提出RepSelect（表征选择性）方法，通过在每次更新前压缩权重梯度的主成分，隔离遗忘集特有的表征，从而在限制微调可恢复内容的同时保持通用能力不受影响。我们在两类遗忘内容（生物危害知识与虐待倾向）以及四种模型家族（涵盖密集架构与混合专家架构：Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）上进行了评估。与五种主流基线方法（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect在重新学习后的答案准确率下降幅度上比最强基线高出4至50倍，并且对少样本提示攻击几乎完全鲁棒。因此，针对选择性表征是实现深层且鲁棒的大语言模型遗忘的重要一步。

English

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.