활성화 패칭을 통한 LLM 언러닝의 깊이 측정

초록

대규모 언어 모델(LLM) 언러닝은 프라이버시 보호와 AI 안전을 위한 중요한 사후 메커니즘으로 부상했지만, 대상 지식이 실제로 삭제되었는지 감사하는 것은 여전히 어려운 과제로 남아 있다. 기존의 출력 수준 지표는 해당 지식이 내부 표현으로부터 여전히 복구 가능할 때 이를 탐지하지 못한다. 최근의 화이트박스 연구는 이러한 잔여 지식을 밝혀내지만, 종종 보조 훈련이나 데이터셋별 적응에 의존하여 일반화 가능한 지표를 제공하지 못한다. 이러한 한계를 해결하기 위해, 우리는 활성화 패칭(activation patching)을 통해 언러닝의 기계적 깊이를 정량화하는 지표인 언러닝 깊이 점수(UDS, Unlearning Depth Score)를 제안한다. UDS는 먼저 유지 모델 기준선을 사용해 대상 지식을 인코딩하는 계층을 식별한 후, 언러닝된 모델에서 해당 지식이 얼마나 삭제되었는지를 0-1 척도로 측정한다. 8가지 방법에 걸친 150개의 언러닝된 모델에 대해 20개 지표를 대상으로 한 메타 평가에서 UDS는 가장 높은 충실도와 견고성을 달성하여, 인과적 접근 방식이 언러닝 평가에서 가장 신뢰할 수 있음을 확인했다. 사례 연구는 추가로 화이트박스 지표가 계층 수준에서 불일치할 수 있으며 삭제 깊이가 예시마다 다르다는 점을 밝혀냈다. 우리는 UDS를 기존 벤치마킹 프레임워크에 통합하고 평가 파이프라인을 간소화하기 위한 지침을 제공한다. 코드와 데이터는 https://github.com/gnueaj/unlearning-depth-score에서 확인할 수 있다.

English

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score