持久遗忘：基于电路归因的量化永久性去学习

摘要

标准的遗忘评估在完全精度下测量行为抑制，且仅在训练后立即进行，而所有部署的语言模型却必须先经过量化。近期研究表明，4位训练后量化（PTQ）能够逆转机器遗忘；我们证明这并非调参伪影，而是系统性双重失效：基于梯度的方法在实现有效遗忘后会在压缩下丧失效果，而能够抵抗量化的方法则几乎不改变模型。这两种失效均源于同一根本原因：在所有基线方法中，每个参数更新的幅度比NF4量化箱宽度低47至828倍；分散在数十亿参数中的更新无法跨越量化箱边界，我们将这一后果形式化为稀疏-持久性权衡。我们提出MANSU（机制对齐零空间遗忘），该方法通过结合因果电路归因以隔离最小遗忘子图、基于对角Fisher保留界的电路受限零空间投影，以及保证量化存活性的逐参数幅度下限，从根本上解决了这两种失效模式。此外，我们引入电路归因散度（CAD），这是一种机制验证指标，能够区分结构性擦除与行为抑制——现有指标无法做出这一区分。在多种模型族和安全基准上，MANSU是首个在所有四项属性上同时满足且各项均有裕度的方法（即有效遗忘、保留保持、非正PTQ差距以及结构性擦除），而基于梯度的基线方法在压缩后精度最高恢复0.05。

English

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.