使用参数化知识痕迹对遗忘进行内在评估

摘要

最近，“遗忘”大型语言模型（LLMs）中某些概念的任务引起了广泛关注，因为这对于减轻模型行为中的不良行为（如生成有害、私密或不正确信息）至关重要。目前评估遗忘方法的常规协议主要依赖行为测试，而不监测模型参数中遗忘知识的存在。这种残留知识可能被对抗性地利用，以在遗忘后恢复已删除的信息。我们认为，遗忘还应该通过考虑未学习概念的参数化知识痕迹的变化来进行内部评估。为此，我们提出了一种通用方法，用于引发参数空间中的方向（称为“概念向量”），这些向量编码具体概念，并构建了ConceptVectors，一个包含两个开源LLMs中数百个常见概念及其参数化知识痕迹的基准数据集。在ConceptVectors上的评估显示，现有的遗忘方法对概念向量的影响很小，而直接消除这些向量明显地从LLMs中删除了相关知识，并显著降低了它们对对抗性操纵的敏感性。我们的结果突显了基于行为的遗忘评估的局限性，并呼吁未来的工作包括基于参数的评估。为了支持这一点，我们在https://github.com/yihuaihong/ConceptVectors 上发布了我们的代码和基准。

English

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

使用参数化知识痕迹对遗忘进行内在评估

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

摘要

Support