CRISP：透過稀疏自編碼器實現持久性概念遺忘

摘要

随着大型语言模型（LLMs）在现实世界应用中的部署日益增多，如何在保持模型效用的同时选择性移除不需要的知识变得至关重要。最近的研究探索了使用稀疏自编码器（SAEs）对单义特征进行精确干预。然而，大多数基于SAE的方法在推理时操作，这不会对模型参数产生持久性改变。此类干预可能被拥有参数访问权限的恶意行为者绕过或逆转。我们提出了CRISP，一种利用SAEs进行持久概念遗忘的参数高效方法。CRISP自动识别跨多个层的显著SAE特征并抑制其激活。我们在两个LLMs上进行了实验，结果表明我们的方法在WMDP基准测试的安全关键遗忘任务上优于先前的方法，成功移除了有害知识，同时保留了一般和领域内的能力。特征级分析显示，CRISP实现了目标与良性概念之间的语义连贯分离，从而精确抑制了目标特征。

English

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

CRISP：透過稀疏自編碼器實現持久性概念遺忘

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

摘要

Support