MAAT: 다단계 어댑터 인식 표적 언러닝

초록

머신 언러닝 평가는 구조적으로 편향되어 있다. 인과 및 관계 지식을 탐구하는 Why 유형 질문은 CounterFact의 0.06% 미만, ZSRE의 0.6% 미만, TOFU, MUSE, WMDP-Cyber의 1.3% 미만을 차지한다. 이러한 거의 제로에 가까운 비율은 인과 지식에 실패하는 방법들이 종합적으로 높은 점수를 받을 수 있음을 의미하며, 이러한 실패는 균형 잡힌 평가 없이는 탐지 불가능하다. 본 연구에서는 5W 범주(누가, 무엇을, 언제, 어디서, 왜)당 1,000개의 예제로 구성된 균형 잡힌 5,000개 샘플 벤치마크인 5WBENCH를 제시하여, 인과적 언러닝 실패를 처음으로 정량화할 수 있게 한다. 5WBENCH를 사용하여, 기존의 어떤 기준선도 Why 유형 질문에서 높은 망각과 높은 유지를 동시에 달성하지 못함을 보여준다. 공격적인 망각은 유지된 지식을 저하시키는 반면, 보수적인 방법은 인과적 사실을 망각하지 못한다. Why 유형의 어려움은 다중 홉 추론 체인(Why 항목의 44% 대 다른 항목의 2% 이하)과 40.1 토큰 답변 범위에 걸친 그래디언트 희석에서 비롯된다. 본 연구에서는 MAAT(Multi-phase Adapter-Aware Targeted Unlearning)를 제시한다. 이는 LoRA 어댑터 가중치에 대해 작동하는 3단계 프레임워크로, 그래디언트 투영 상승, SVD 순위 차원 가지치기, 작업 벡터 부정, 하이브리드 KL-은닉 상태 유지 복구를 결합한다. MAAT는 Why 유형 인과 지식에 대해 높은 망각과 높은 유지를 동시에 달성한 첫 번째 방법으로, 망각-유지 파레토 프론티어에서 새로운 작동 지점에 도달한다. 본 연구는 코드를 공개한다.

English

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.