Blijvend Vergeten: Quantisatie-permanent Verleren via Circuitattributie

Samenvatting

Standaard unlearning-evaluaties meten gedragsonderdrukking in volledige precisie, onmiddellijk na training, terwijl elk ingezet taalmodel eerst wordt gekwantiseerd. Recent werk heeft aangetoond dat 4-bit kwantisatie na training machine-unlearning kan omkeren; wij tonen aan dat dit geen afstemmingsartefact is maar een systematisch duaal falen: gradiëntgebaseerde methoden die zinvol vergeten bereiken, verliezen dit onder compressie, terwijl methoden die kwantisatie overleven het model nauwelijks veranderen. Beide vormen van falen zijn terug te voeren op dezelfde oorzaak: over alle baselines liggen per-parameterupdates 47–828 keer onder de NF4-kwantisatiebinbreedte; updates die over miljarden parameters zijn verspreid, kunnen de kwantisatiebingrenzen niet overschrijden – een gevolg dat wij formaliseren als een sparsity-permanentie-afweging. Wij presenteren MANSU (Mechanistisch-Gericht Nulruimte-Unlearning), dat beide modi aanpakt door causale circuitattributie te combineren om de minimale vergeet-set-subgraaf te isoleren, circuit-beperkte nulruimteprojectie met een diagonaal-Fisher-behoudsgrens, en een per-parameter-magnitudevloer die kwantiseringsoverleving per constructie garandeert. Daarnaast introduceren wij Circuitattributiedivergentie (CAD), een mechanistische verificatiemetriek die structureel wissen onderscheidt van gedragsonderdrukking – een onderscheid dat bestaande metrieken niet kunnen maken. Over meerdere modelfamilies en hazardbenchmarks is MANSU de eerste methode die gezamenlijk aan alle vier eigenschappen voldoet met marge op elk (zinvol vergeten, behoud van bewaarde kennis, niet-positieve PTQ-kloof en structureel wissen), terwijl gradiëntgebaseerde baselines tot +0,05 nauwkeurigheid herwinnen onder compressie.

English

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.