MAAT: 多阶段适配器感知的定向遗忘

摘要

机器反学习评估在结构上存在偏差：探究因果和关系知识的“为什么”类问题，在CounterFact中占比不足0.06%，在ZSRE中为0.6%，在TOFU、MUSE和WMDP-Cyber中低于1.3%。这种近乎为零的占比意味着，在因果知识上失效的方法仍可在整体评估中获得高分，且此类失效在缺乏均衡评估时无法被检测。我们提出5WBENCH，一个包含5000个样本的均衡基准，其中每个5W类别（谁、什么、何时、何地、为什么）各有1000个样本，首次使得因果反学习的失效得以量化。使用5WBENCH，我们发现现有方法均无法同时在Why型问题上实现高遗忘和高保留：激进遗忘会损害保留知识，而保守方法则无法遗忘因果事实。Why型的困难源于多跳推理链（Why条目占44%，其他类别≤2%）以及超过40.1个token的答案跨度导致的梯度稀释。我们提出MAAT（多阶段适配器感知定向反学习），这是一个在LoRA适配器权重上运行的三阶段框架，结合了梯度投影上升、SVD秩维度剪枝、任务向量否定以及混合KL-隐藏状态保留修复。MAAT是首个在Why型因果知识上同时实现高遗忘和高保留的方法，在遗忘-保留帕累托前沿上达到了新操作点。我们公开提供代码。

English

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.