ReLearn：透過學習實現大型語言模型的反學習

摘要

目前針對大型語言模型的遺忘方法通常依賴反向優化來降低目標標記機率。然而，這種範式會干擾後續標記的預測，降低模型性能和語言連貫性。此外，現有的評估指標過分強調情境遺忘，同時未能充分評估回應流暢度和相關性。為應對這些挑戰，我們提出了 ReLearn，這是一個用於有效遺忘的數據擴增和微調流程，以及一個全面的評估框架。該框架引入了知識遺忘率（KFR）和知識保留率（KRR）來衡量知識級別的保留，以及語言分數（LS）來評估生成質量。我們的實驗表明，ReLearn 成功實現了有針對性的遺忘，同時保留了高質量的輸出。通過機制分析，我們進一步展示了反向優化如何干擾連貫文本生成，而 ReLearn 保留了這一基本能力。代碼可在 https://github.com/zjunlp/unlearn 找到。

English

Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.

ReLearn：透過學習實現大型語言模型的反學習

ReLearn: Unlearning via Learning for Large Language Models

摘要

Support