持久遺忘：基於電路歸因的量化永久性遺忘

摘要

標準的反學習評估測量的是全精度下的行為抑制，且僅在訓練完成後立即進行，然而所有部署的語言模型都必須先經過量化處理。近期研究顯示，4位元訓練後量化能夠逆轉機器反學習的效果；我們證明這並非單純的調校假象，而是系統性的雙重失效：能夠實現有意義遺忘的梯度式方法，在壓縮後會失去遺忘效果；而能夠在量化後存活下來的方法，卻幾乎不會改變模型。這兩種失效模式都源於同一個根本原因：在所有基準方法中，每個參數的更新量比NF4量化分箱寬度低了47至828倍；分散在數十億個參數中的更新量無法跨越量化分箱的邊界，我們將此現象形式化為「稀疏性-持久性權衡」。我們提出MANSU（機制對齊零空間反學習），透過結合因果電路歸因以隔離最小遺忘集合子圖、使用對角Fisher保留界限進行電路限制的零空間投影，以及透過建構方式保證量化存活性的每個參數量級下限，來解決這兩種失效模式。我們還引入了「電路歸因分歧」（CAD），這是一種機制性驗證指標，能夠區分結構擦除與行為抑制——這是現有指標無法做到的區分。在多個模型系列與危害基準測試中，MANSU是首個能夠同時滿足所有四項特性（有意義的遺忘、保留維持、非正PTQ差距、結構擦除）且每項都留有餘裕的方法，而基於梯度的基準方法在壓縮後會恢復高達+0.05的準確率。

English

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.