パラメトリック知識トレースを用いたアンラーニングの内在的評価

要旨

大規模言語モデル（LLM）における特定の概念の「アンラーニング」タスクは、有害な、プライベートな、または誤った情報の生成といった望ましくないモデルの挙動を緩和する重要性から、最近大きな注目を集めています。現在のアンラーニング手法の評価プロトコルは、主に行動テストに依存しており、モデルのパラメータ内に残存するアンラーニングされた知識の存在を監視していません。この残存知識は、アンラーニング後に消去された情報を回復するために敵対的に利用される可能性があります。私たちは、アンラーニングされた概念のパラメトリックな知識の痕跡の変化を考慮することで、アンラーニングを内部的にも評価すべきであると主張します。この目的のために、具体的な概念をエンコードするパラメータ空間内の方向（「概念ベクトル」と呼ぶ）を引き出すための一般的な方法論を提案し、2つのオープンソースLLM内に含まれる数百の一般的な概念とそのパラメトリックな知識の痕跡を含むベンチマークデータセット「ConceptVectors」を構築しました。ConceptVectorsでの評価により、既存のアンラーニング手法が概念ベクトルにほとんど影響を与えない一方で、これらのベクトルを直接除去することで、関連する知識がLLMから確実に削除され、敵対的操作に対する感受性が大幅に低減されることが示されました。私たちの結果は、行動ベースのアンラーニング評価の限界を浮き彫りにし、将来的な研究においてパラメータベースの評価を含めることを呼びかけます。これを支援するため、私たちはコードとベンチマークをhttps://github.com/yihuaihong/ConceptVectorsで公開しています。

English

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

パラメトリック知識トレースを用いたアンラーニングの内在的評価

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

要旨

Support