파라미터 지식 흔적을 활용한 언러닝의 내재적 평가

초록

대규모 언어 모델(LLMs)에서 특정 개념을 "언러닝(unlearning)"하는 작업은 유해한 정보, 개인 정보 또는 잘못된 정보의 생성과 같은 바람직하지 않은 모델 행동을 완화하는 데 중요하기 때문에 최근 엄청난 관심을 받고 있습니다. 현재 언러닝 방법을 평가하기 위한 프로토콜은 주로 행동 테스트에 의존하며, 모델의 매개변수 내에서 언러닝된 지식의 존재 여부를 모니터링하지 않습니다. 이러한 잔여 지식은 언러닝 후 삭제된 정보를 복구하기 위해 적대적으로 악용될 수 있습니다. 우리는 언러닝이 내부적으로도 평가되어야 하며, 언러닝된 개념의 매개변수적 지식 흔적의 변화를 고려해야 한다고 주장합니다. 이를 위해, 우리는 구체적인 개념을 인코딩하는 매개변수 공간 내의 방향(이하 "개념 벡터(concept vectors)"라고 함)을 도출하는 일반적인 방법론을 제안하고, 두 개의 오픈소스 LLMs 내에서 수백 개의 일반적인 개념과 그들의 매개변수적 지식 흔적을 포함한 벡터 데이터셋인 ConceptVectors를 구축했습니다. ConceptVectors에 대한 평가는 기존의 언러닝 방법이 개념 벡터에 미미한 영향을 미치는 반면, 이러한 벡터를 직접 제거하면 LLMs에서 관련 지식이 제거되고 적대적 조작에 대한 취약성이 크게 감소함을 보여줍니다. 우리의 결과는 행동 기반 언러닝 평가의 한계를 강조하며, 향후 연구에서 매개변수 기반 평가를 포함할 것을 요구합니다. 이를 지원하기 위해, 우리는 코드와 벤치마크를 https://github.com/yihuaihong/ConceptVectors에서 공개합니다.

English

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

파라미터 지식 흔적을 활용한 언러닝의 내재적 평가

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

초록

Summary

Support

Support