大语言模型的遗忘机制应具备形式无关性
LLM Unlearning Should Be Form-Independent
June 9, 2025
作者: Xiaotian Ye, Mengqi Zhang, Shu Wu
cs.AI
摘要
大型语言模型(LLM)遗忘技术旨在消除或抑制模型中的不良知识,为控制有害或隐私信息以防止滥用提供了希望。然而,近期研究揭示其在现实场景中的效果有限,阻碍了实际应用。本研究中,我们识别出导致众多下游任务失败的一个普遍问题:现有遗忘方法的有效性高度依赖于训练样本的形式,且往往无法推广到同一知识的不同表达方式。我们正式将这一问题定义为形式依赖偏差,并系统性地探究了其在各类下游任务中的具体表现模式。为了量化其普遍性并支持未来研究,我们引入了ORT,一个新颖的基准测试,旨在评估遗忘方法面对知识表达变化时的鲁棒性。结果显示,形式依赖偏差在当前技术中既普遍又严重。
我们主张,LLM遗忘应具备形式独立性,以应对现实世界安全关键场景中遇到的无尽下游任务形式。为此,我们提出了秩一概念重定向(ROCR),一种无需训练的新方法,作为一条有前景的解决路径。ROCR通过针对下游任务中的不变量,特别是被激活的危险概念,执行遗忘操作。它能在几秒钟内修改模型参数,将模型对特定遗忘目标概念的感知重定向至另一个无害概念。大量实验证明,与传统方法相比,ROCR显著提升了遗忘效果,同时生成高度自然的输出。
English
Large Language Model (LLM) unlearning aims to erase or suppress undesirable
knowledge within the model, offering promise for controlling harmful or private
information to prevent misuse. However, recent studies highlight its limited
efficacy in real-world scenarios, hindering practical adoption. In this study,
we identify a pervasive issue underlying many downstream failures: the
effectiveness of existing unlearning methods heavily depends on the form of
training samples and frequently fails to generalize to alternate expressions of
the same knowledge. We formally characterize this problem as Form-Dependent
Bias and systematically investigate its specific manifestation patterns across
various downstream tasks. To quantify its prevalence and support future
research, we introduce ORT, a novel benchmark designed to evaluate the
robustness of unlearning methods against variations in knowledge expression.
Results reveal that Form-Dependent Bias is both widespread and severe among
current techniques.
We argue that LLM unlearning should be form-independent to address the
endless forms of downstream tasks encountered in real-world security-critical
scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR),
a novel training-free method, as a promising solution path. ROCR performs
unlearning by targeting the invariants in downstream tasks, specifically the
activated dangerous concepts. It is capable of modifying model parameters
within seconds to redirect the model's perception of a specific unlearning
target concept to another harmless concept. Extensive experiments demonstrate
that ROCR significantly improves unlearning effectiveness compared to
traditional methods while generating highly natural outputs.