アクティベーションパッチングを用いたLLMアンラーニングの深さの計測

要旨

大規模言語モデル（LLM）のアンラーニングは、プライバシー保護とAI安全性のための重要なポストホックメカニズムとして登場したが、対象知識が本当に消去されたかどうかを監査することは依然として困難である。既存の出力レベルの指標では、この知識が内部表現から回復可能なままである場合を検出できない。最近のホワイトボックス研究ではそのような残留知識が明らかにされているが、多くの場合、補助的な訓練やデータセット固有の適応に依存しており、一般化可能な指標は残されていない。これらの限界に対処するため、我々はアクティベーションパッチングを介してアンラーニングのメカニズム的深さを定量化する指標であるUnlearning Depth Score（UDS）を提案する。UDSはまず、保持モデルのベースラインを用いて対象知識をエンコードする層を特定し、次にアンラーニング済みモデルにおいてその知識がどの程度消去されたかを0-1スケールで測定する。8つの手法にわたる150のアンラーニング済みモデルに対する20の指標のメタ評価において、UDSは最高の忠実性とロバスト性を達成し、我々の因果的アプローチがアンラーニング評価に最も信頼できるものであることを確認した。ケーススタディではさらに、ホワイトボックス指標が層レベルで一致しない可能性があること、また消去の深さが例によって異なることが明らかになった。我々はUDSを既存のベンチマーキングフレームワークに統合し、評価パイプラインを効率化するためのガイドラインを提供する。コードとデータは https://github.com/gnueaj/unlearning-depth-score で入手可能である。

English

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score