CRISP: スパースオートエンコーダーによる永続的概念アンラーニング

要旨

大規模言語モデル（LLM）が実世界のアプリケーションでますます活用されるにつれ、モデルの有用性を維持しながら不要な知識を選択的に除去する必要性が極めて重要となっています。最近の研究では、単一意味的特徴に対して精密な介入を行うためにスパースオートエンコーダ（SAE）が探求されています。しかし、ほとんどのSAEベースの手法は推論時に動作するため、モデルのパラメータに永続的な変更を加えることはありません。このような介入は、パラメータにアクセスできる悪意のあるアクターによって回避または逆転される可能性があります。本論文では、SAEを用いた永続的な概念忘却のためのパラメータ効率的な手法であるCRISPを紹介します。CRISPは、複数の層にわたって重要なSAE特徴を自動的に特定し、それらの活性化を抑制します。2つのLLMを用いて実験を行い、WMDPベンチマークからの安全クリティカルな忘却タスクにおいて、我々の手法が従来のアプローチを上回り、有害な知識を成功裏に除去しながら一般的およびドメイン内の能力を維持することを示します。特徴レベルの分析により、CRISPがターゲットと良性の概念間で意味的に一貫した分離を達成し、ターゲット特徴の精密な抑制を可能にすることが明らかになりました。

English

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

CRISP: スパースオートエンコーダーによる永続的概念アンラーニング

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

要旨

Support