永続する忘却：回路帰属による量子化永続的アンラーニング

要旨

標準的なアンラーニング評価では、展開されるすべての言語モデルがまず量子化されるにもかかわらず、フル精度、学習直後の行動抑制を測定している。近年の研究により、4ビット学習後量子化が機械学習のアンラーニングを逆転させ得ることが示されたが、本稿ではこれがチューニングのアーティファクトではなく、体系的な二重の失敗であることを示す。すなわち、意味のある忘却を達成する勾配ベース手法は圧縮下でそれを失う一方で、量子化に耐える手法はモデルをほとんど変化させない。両方の失敗は同じ根本原因に由来する。すなわち、すべてのベースラインにおいて、パラメータ単位の更新量はNF4量子化ビン幅の47〜828倍未満であり、数十億のパラメータに分散された更新は量子化ビンの境界を超えられない。この結果を我々は疎性-永続性トレードオフとして定式化する。本稿では、因果回路帰属により最小の忘却集合部分グラフを特定し、対角Fisher保持境界を伴う回路制限ヌル空間射影、および量子化生存を構造的に保証するパラメータ単位の大きさ下限を組み合わせることで、両方のモードを解決するMANSU（Mechanistic-Aligned Null-Space Unlearning、メカニズム整合ヌル空間アンラーニング）を提案する。さらに、既存の指標では区別できない、構造的消去と行動抑制を区別するメカニズム検証指標である回路帰属乖離度（CAD）を導入する。複数のモデルファミリーとハザードベンチマークにわたり、MANSUは4つの特性すべてをそれぞれに余裕を持って同時に満たす最初の手法であり（意味のある忘却、保持保存、非正のPTQギャップ、構造的消去）、一方、勾配ベースのベースラインは圧縮下で最大+0.05の精度を回復する。

English

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.