使用参数化知识痕迹对遗忘进行内在评估
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
June 17, 2024
作者: Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, Mor Geva
cs.AI
摘要
最近,“遗忘”大型语言模型(LLMs)中某些概念的任务引起了广泛关注,因为这对于减轻模型行为中的不良行为(如生成有害、私密或不正确信息)至关重要。目前评估遗忘方法的常规协议主要依赖行为测试,而不监测模型参数中遗忘知识的存在。这种残留知识可能被对抗性地利用,以在遗忘后恢复已删除的信息。我们认为,遗忘还应该通过考虑未学习概念的参数化知识痕迹的变化来进行内部评估。为此,我们提出了一种通用方法,用于引发参数空间中的方向(称为“概念向量”),这些向量编码具体概念,并构建了ConceptVectors,一个包含两个开源LLMs中数百个常见概念及其参数化知识痕迹的基准数据集。在ConceptVectors上的评估显示,现有的遗忘方法对概念向量的影响很小,而直接消除这些向量明显地从LLMs中删除了相关知识,并显著降低了它们对对抗性操纵的敏感性。我们的结果突显了基于行为的遗忘评估的局限性,并呼吁未来的工作包括基于参数的评估。为了支持这一点,我们在https://github.com/yihuaihong/ConceptVectors 上发布了我们的代码和基准。
English
The task of "unlearning" certain concepts in large language models (LLMs) has
attracted immense attention recently, due to its importance for mitigating
undesirable model behaviours, such as the generation of harmful, private, or
incorrect information. Current protocols to evaluate unlearning methods largely
rely on behavioral tests, without monitoring the presence of unlearned
knowledge within the model's parameters. This residual knowledge can be
adversarially exploited to recover the erased information post-unlearning. We
argue that unlearning should also be evaluated internally, by considering
changes in the parametric knowledge traces of the unlearned concepts. To this
end, we propose a general methodology for eliciting directions in the parameter
space (termed "concept vectors") that encode concrete concepts, and construct
ConceptVectors, a benchmark dataset containing hundreds of common concepts and
their parametric knowledge traces within two open-source LLMs. Evaluation on
ConceptVectors shows that existing unlearning methods minimally impact concept
vectors, while directly ablating these vectors demonstrably removes the
associated knowledge from the LLMs and significantly reduces their
susceptibility to adversarial manipulation. Our results highlight limitations
in behavioral-based unlearning evaluations and call for future work to include
parametric-based evaluations. To support this, we release our code and
benchmark at https://github.com/yihuaihong/ConceptVectors.Summary
AI-Generated Summary