使用參數化知識蹤跡內在評估遺忘

摘要

最近，"取消學習"大型語言模型（LLMs）中某些概念的任務引起了廣泛關注，因為這對於減輕不良模型行為（例如生成有害、私密或不正確信息）至關重要。目前評估取消學習方法的協議主要依賴行為測試，而沒有監控模型參數中取消學習知識的存在。這種剩餘知識可能被敵對地利用，以在取消學習後恢復被刪除的信息。我們認為取消學習應該在內部進行評估，考慮取消學習概念的參數知識軌跡的變化。為此，我們提出了一種通用方法，用於引出參數空間中的方向（稱為"概念向量"），這些向量編碼具體概念，並構建了ConceptVectors，一個包含數百個常見概念及其在兩個開源LLMs中的參數知識軌跡的基準數據集。對ConceptVectors的評估顯示，現有的取消學習方法對概念向量的影響微乎其微，而直接刪除這些向量明顯地從LLMs中刪除了相關知識，並顯著降低了它們對敵對操縱的敏感性。我們的結果突顯了基於行為的取消學習評估的局限性，呼籲未來的工作應包括基於參數的評估。為了支持這一點，我們在https://github.com/yihuaihong/ConceptVectors 上發布了我們的代碼和基準。

English

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

使用參數化知識蹤跡內在評估遺忘

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

摘要

Support