MAAT：多階段適配器感知的定向遺忘

摘要

機器學習的遺忘評估在結構上存在偏差：「為什麼」類型的問題（探討因果與關係知識）在CounterFact中佔比低於0.06%，在ZSRE中佔0.6%，而在TOFU、MUSE及WMDP-Cyber中更佔不到1.3%。這種近乎為零的呈現比例意味著，在因果知識上失效的方法仍可在整體評分中獲得高分，且此類失效在缺乏平衡評估的情況下無從察覺。我們提出5WBENCH，一個平衡的5,000樣本基準測試，每個5W類別（誰、什麼、何時、何地、為什麼）各有1,000個樣本，首次使因果遺忘失效得以量化。利用5WBENCH，我們證明現有基準方法無法同時在「為什麼」類型問題上達到高遺忘度與高保留度：激進的遺忘會破壞保留的知識，而保守的方法則無法遺忘因果事實。「為什麼」類型的難度源於多跳推理鏈（44%的「為什麼」條目涉及此特性，其他類別則低於或等於2%），以及答案跨度超過40.1個token所導致的梯度稀釋。我們提出MAAT（多階段適配器感知的目標性遺忘），這是一個基於LoRA適配器權重的三階段框架，結合了梯度投影上升、SVD秩維度剪枝、任務向量否定，以及混合KL隱藏狀態保留修復。MAAT是首個能在「為什麼」類型因果知識上同時實現高遺忘度與高保留度的方法，在遺忘-保留帕雷托前緣上達到新的運作點。我們已公開程式碼。

English

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.