SAEはアンラーニングを改善可能：LLMにおける精密なアンラーニングのための動的スパースオートエンコーダガードレール

要旨

機械学習モデルからの知識削除（Machine Unlearning）は、LLMの安全性を向上させるための有望なアプローチです。しかし、現在主流の勾配ベースの削除手法は、高い計算コスト、ハイパーパラメータの不安定性、逐次的な削除能力の低さ、再学習攻撃への脆弱性、データ効率の悪さ、解釈可能性の欠如といった課題を抱えています。スパースオートエンコーダ（SAE）は、ターゲットを絞った活性化ベースの削除を可能にすることでこれらの側面を改善するのに適していますが、従来の手法は勾配ベースの方法に劣っていました。本研究では、これまでの知見とは異なり、SAEを動的に活用することで削除性能を大幅に向上できることを示します。我々は、原理に基づいた特徴選択と動的分類器を活用した新しい精密削除手法であるDynamic DAE Guardrails（DSG）を提案します。実験の結果、DSGは主要な削除手法を大幅に上回り、優れた忘却-有用性のトレードオフを達成することが示されました。DSGは、勾配ベースの削除手法の主要な欠点を解決します――計算効率と安定性の向上、逐次削除における堅牢な性能、再学習攻撃に対する強い耐性、ゼロショット設定を含む優れたデータ効率、そしてより解釈可能な削除を実現します。

English

Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce Dynamic DAE Guardrails (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

SAEはアンラーニングを改善可能：LLMにおける精密なアンラーニングのための動的スパースオートエンコーダガードレール

SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

要旨

Support