CRISP：通过稀疏自编码器实现持久概念遗忘

摘要

随着大语言模型（LLMs）在现实应用中的部署日益增多，如何在保持模型实用性的同时选择性移除不需要的知识变得至关重要。近期研究探索了使用稀疏自编码器（SAEs）对单义特征进行精确干预。然而，大多数基于SAE的方法在推理阶段操作，这并未在模型参数中产生持久性改变。此类干预可能被拥有参数访问权限的恶意行为者绕过或逆转。我们提出了CRISP，一种利用SAEs实现持久概念遗忘的参数高效方法。CRISP自动识别跨多个层的显著SAE特征并抑制其激活。我们在两个LLMs上进行了实验，结果表明，在WMDP基准测试的安全关键遗忘任务中，我们的方法优于先前的方法，成功移除了有害知识，同时保留了一般性和领域内能力。特征级分析显示，CRISP实现了目标概念与良性概念在语义上的清晰分离，从而能够精确抑制目标特征。

English

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

CRISP：通过稀疏自编码器实现持久概念遗忘

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

摘要

Support