通过激活修补测量大语言模型遗忘的深度

摘要

大语言模型（LLM）反学习已成为隐私保护与AI安全领域至关重要的后置机制，然而，审计目标知识是否被真正擦除仍具挑战性。现有的输出级指标无法检测到当这些知识可以从内部表征中恢复的情形。近期的白盒研究表明此类残留知识的存在，但往往依赖于辅助训练或特定数据集的适应性调整，缺乏通用化的评估指标。为解决这些局限性，我们提出了反学习深度得分（UDS），这是一种通过激活修补来量化反学习机制深度的指标。UDS首先利用保留模型基线识别编码目标知识的层，然后以0-1尺度衡量反学习模型中该知识被擦除的程度。在涵盖8种方法的150个反学习模型上的20项指标元评估中，UDS取得了最高的忠实性与鲁棒性，证实了我们基于因果的方法在反学习评估中最为可靠。案例研究进一步揭示，白盒指标可能在层级别上存在分歧，且擦除深度因示例而异。我们提供了将UDS集成到现有基准框架并简化评估流程的指导方针。代码与数据见https://github.com/gnueaj/unlearning-depth-score。

English

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score