LLM 的遺忘機制應與形式無關

摘要

大型語言模型（LLM）的遺忘學習旨在消除或抑制模型內的不良知識，為控制有害或私人信息以防止濫用提供了希望。然而，最近的研究強調了其在現實場景中的有限效果，阻礙了實際應用。在本研究中，我們發現了許多下游失敗背後的一個普遍問題：現有遺忘學習方法的有效性嚴重依賴於訓練樣本的形式，並且經常無法推廣到相同知識的其他表達方式。我們正式將這一問題定義為形式依賴偏差，並系統地研究了其在各種下游任務中的具體表現模式。為了量化其普遍性並支持未來研究，我們引入了ORT，這是一個新穎的基準，旨在評估遺忘學習方法在知識表達變化下的魯棒性。結果顯示，形式依賴偏差在當前技術中既普遍又嚴重。我們認為，LLM的遺忘學習應該是形式獨立的，以應對現實世界安全關鍵場景中遇到的無盡下游任務形式。為實現這一目標，我們引入了秩一概念重定向（ROCR），這是一種新穎的無訓練方法，作為一個有前景的解決方案。ROCR通過針對下游任務中的不變量，特別是激活的危險概念，來執行遺忘學習。它能夠在幾秒鐘內修改模型參數，將模型對特定遺忘目標概念的感知重定向到另一個無害的概念。大量實驗表明，與傳統方法相比，ROCR顯著提高了遺忘學習的有效性，同時生成高度自然的輸出。

English

Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.

LLM 的遺忘機制應與形式無關

LLM Unlearning Should Be Form-Independent

摘要

Support