RepSelect：基於表徵選擇性的穩健LLM遺忘

摘要

使大型语言模型（LLMs）深度遗忘特定知识与价值观而不損及通用能力，仍是遗忘学习中的核心挑戰。然而，現有方法易因微調或少量樣本提示而逆向復原，顯示其遺忘效果僅停留在淺層。我們揭示了根本原因：現有方法針對的表示層級同時與保留集及微調攻擊者所能回復的子空間重疊，導致遺忘既破壞通用能力又易於逆向。為此，我們提出RepSelect（表示選擇性），在每次更新前收縮權重梯度前幾個主成分，從而隔離僅屬於遺忘集的表示，既保留通用能力，又限制微調可回復的內容。我們在兩種遺忘類別（生物危害知識與濫用傾向）以及四種涵蓋密集式與混合專家架構的模型系列（Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）上進行評估。與五種主流基準方法（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect在重學後答案準確率下降幅度上比最強基準高出4至50倍，且對於少量樣本提示攻擊近乎完全穩健。因此，針對選擇性表示進行學習，是邁向深層且穩健的LLM遺忘的重要一步。

English

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.