ZeroUnlearn: 大規模言語モデルにおけるFew-Shot知識アンラーニング

要旨

大規模言語モデルは、膨大なウェブコーパスで学習するため、有害な生成を引き起こす可能性のある入力として定義される機密情報を必然的に保持し、プライバシーと安全性への懸念を引き起こす。既存の機械学習アンラーニング手法は主に再学習または過激なファインチューニングに依存しており、これらは計算コストが高いか、関連する知識やモデル全体の有用性を劣化させる傾向がある。本研究では、機械学習アンラーニングをモデル編集による精密な知識再マッピング問題として再定義する。我々は、数ショットアンラーニングフレームワークであるZeroUnlearnを提案する。これは、機密入力を中立なターゲット状態にマッピングし、元の表現を削除することで上書きする。ZeroUnlearnは、閉形式解を持つ乗法的パラメータ更新を通じて表現の直交性を強制し、効率的でターゲットを絞ったアンラーニングを可能にする。さらに、ZeroUnlearnをマルチサンプルアンラーニングのための勾配ベースの変種に拡張する。実験により、我々のアプローチが既存のベースラインを上回りつつ、モデルの一般的な有用性を保持することを実証する。コードはGitHub (https://github.com/XMUDeepLIT/ZeroUnlearn) で公開されている。

English

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.