透過激活修補衡量大型語言模型去學習的深度

摘要

大型語言模型（LLM）遺忘已成為隱私保護與AI安全的重要後設機制，然而如何驗證目標知識是否真正被刪除仍是一項挑戰。現有的輸出層級指標無法檢測到這些知識是否仍可從內部表徵中還原。近期白箱研究雖能揭示此類殘留知識，但往往依賴輔助訓練或資料集特定調整，缺乏通用的可量化指標。為解決這些限制，我們提出「遺忘深度分數」（Unlearning Depth Score, UDS），這項指標藉由激活修補（activation patching）量化遺忘的機制深度。UDS 首先以保留模型為基準識別編碼目標知識的層級，接著在 0 到 1 的尺度上衡量已遺忘模型中該知識被抹除的程度。在橫跨 8 種方法、150 個已遺忘模型的 20 項指標元評估中，UDS 展現出最高的忠實度與穩健性，證實我們以因果方式進行的評估是遺忘驗證中最可靠的方法。案例分析進一步顯示，白箱指標可能在層級層面產生分歧，且不同範例的遺除深度亦有所差異。我們提供了將 UDS 整合至現有評估框架的指引，並簡化評估流程。程式碼與資料已公開於 https://github.com/gnueaj/unlearning-depth-score。

English

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score