MAAT: 多段階アダプタ認識型ターゲット忘却

要旨

機械学習の忘却評価は構造的に偏っている。因果的・関係的知識を探るWhy型の質問は、CounterFactでは0.06%未満、ZSREでは0.6%未満、TOFU、MUSE、WMDP-Cyberでは1.3%未満を占めるに過ぎない。このほぼゼロに等しい比率は、因果的知識において失敗する手法であっても総合的に高いスコアを獲得し得ることを意味し、その失敗はバランスの取れた評価なしには検出できない。我々は5WBENCHを提案する。これは、5Wカテゴリ（Who、What、When、Where、Why）ごとに1,000サンプル、合計5,000サンプルからなるバランスの取れたベンチマークであり、因果的忘却の失敗を初めて定量化可能にする。5WBENCHを用いて、Why型質問において高い忘却性能と高い保持性能を同時に達成する既存のベースラインは存在しないことを示す。すなわち、攻撃的な忘却は保持知識を劣化させる一方、保守的な手法は因果的事実を忘却できない。Why型の困難性は、マルチホップ推論連鎖（Why項目の44%に対し、その他は2%以下）と、平均40.1トークンからなる回答スパンにおける勾配希薄化に起因する。我々はMAAT（Multi-phase Adapter-Aware Targeted Unlearning）を提案する。これは、LoRAアダプター重みに作用する3フェーズフレームワークであり、勾配投影上昇、SVDランク次元刈り込み、タスクベクトル否定、およびハイブリッドKL隠れ状態保持修復を組み合わせる。MAATは、Why型因果的知識に対して高い忘却性能と高い保持性能を同時に達成する初めての手法であり、忘却-保持パレートフロンティア上に新たな動作点を確立する。我々はコードを公開する。

English

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.