고착되는 망각: 회로 귀속을 통한 양자화 영구 망각

초록

표준 언러닝 평가는 모든 배포된 언어 모델이 먼저 양자화됨에도 불구하고, 훈련 직후의 완전 정밀도에서 행동 억제를 측정합니다. 최근 연구는 4비트 학습 후 양자화가 기계 언러닝을 되돌릴 수 있음을 보여주었습니다. 본 논문은 이것이 튜닝 아티팩트가 아니라 체계적인 이중 실패임을 제시합니다. 즉, 의미 있는 망각을 달성하는 경사 기반 방법은 압축 하에서 이를 상실하는 반면, 양자화에서 생존하는 방법은 모델을 거의 변화시키지 않습니다. 두 실패 모두 동일한 근본 원인에 기인합니다. 모든 기준선에서 매개변수별 업데이트가 NF4 양자화 빈 폭보다 47~828배 작으며, 수십억 개의 매개변수에 분산된 업데이트는 양자화 빈 경계를 넘을 수 없으며, 이는 희소성-영속성 상충 관계로 정식화됩니다. 본 논문은 두 모드를 모두 해결하는 MANSU(Mechanistic-Aligned Null-Space Unlearning)를 제시합니다. 이는 인과 회로 귀속을 통해 최소 망각 집합 하위 그래프를 분리하고, 대각 피셔 보존 경계가 적용된 회로 제한 널 공간 투영을 수행하며, 구성적으로 양자화 생존을 보장하는 매개변수별 크기 하한을 적용합니다. 또한, 기존 평가 지표로는 구분할 수 없는 구조적 삭제와 행동 억제를 구별하는 기계론적 검증 지표인 회로 귀속 발산(CAD)을 도입합니다. 여러 모델 패밀리 및 위험 벤치마크에서 MANSU는 각 속성에 여유를 두고 네 가지 속성(의미 있는 망각, 보존 유지, 비양의 PTQ 격차, 구조적 삭제)을 모두 만족시키는 최초의 방법이며, 경사 기반 기준선은 압축 하에서 최대 +0.05 정확도를 회복합니다.

English

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.