CRISP: 희소 오토인코더를 통한 지속적 개념 언러닝

초록

대규모 언어 모델(LLM)이 실제 애플리케이션에 점점 더 많이 배포됨에 따라, 모델의 유용성을 유지하면서 원치 않는 지식을 선택적으로 제거할 필요성이 중요해졌다. 최근 연구에서는 단의적(single-meaning) 특성에 대한 정밀한 개입을 수행하기 위해 희소 오토인코더(SAE)를 탐구해왔다. 그러나 대부분의 SAE 기반 방법은 추론 시점에 작동하며, 이는 모델의 매개변수에 지속적인 변화를 만들지 못한다. 이러한 개입은 매개변수 접근 권한을 가진 악의적인 행위자에 의해 우회되거나 되돌릴 수 있다. 우리는 SAE를 사용한 지속적인 개념 제거를 위한 매개변수 효율적 방법인 CRISP를 소개한다. CRISP는 여러 계층에 걸쳐 중요한 SAE 특성을 자동으로 식별하고 그 활성화를 억제한다. 우리는 두 가지 LLM을 대상으로 실험을 진행했으며, WMDP 벤치마크의 안전 관련 제거 작업에서 기존 접근법을 능가하는 성능을 보여주면서 유해한 지식을 성공적으로 제거하고 일반적 및 도메인 내 능력을 보존함을 입증했다. 특성 수준 분석은 CRISP가 목표 개념과 무해한 개념 간에 의미적으로 일관된 분리를 달성함으로써 목표 특성을 정확하게 억제할 수 있음을 보여준다.

English

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

CRISP: 희소 오토인코더를 통한 지속적 개념 언러닝

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

초록

Support