대규모 언어 모델에서 정밀한 매개변수 내 개념 삭제

초록

대규모 언어 모델(LLM)은 사전 학습 과정에서 하위 배포 시 바람직하지 않은 지식, 예를 들어 민감한 정보나 저작권이 있는 콘텐츠를 습득하는 경우가 많습니다. 이러한 지식을 제거하기 위한 기존 접근 방식은 미세 조정(fine-tuning), 저순위 어댑터(low-rank adapter) 학습 또는 사실 수준 편집(fact-level editing)에 의존하지만, 이들은 너무 거칠거나, 너무 피상적이거나, 효과적이지 못합니다. 본 연구에서는 PISCES(Precise In-parameter Suppression for Concept EraSure)라는 새로운 프레임워크를 제안합니다. 이는 매개변수 공간에서 개념을 인코딩하는 방향을 직접 편집함으로써 모델 매개변수에서 전체 개념을 정밀하게 삭제하는 방법입니다. PISCES는 분리기(disentangler) 모델을 사용하여 MLP 벡터를 해석 가능한 특징으로 분해하고, 자동화된 해석 기술을 사용하여 대상 개념과 관련된 특징을 식별한 후 이를 모델 매개변수에서 제거합니다. Gemma 2와 Llama 3.1을 대상으로 다양한 개념에 대해 실험한 결과, PISCES는 주요 삭제 방법 대비 효과성에서 소폭의 개선을 보였으며, 대상 개념에 대한 정확도를 최대 7.7%까지 낮추는 동시에 삭제 특이성(최대 31%)과 견고성(최대 38%)을 크게 향상시켰습니다. 전반적으로 이러한 결과는 특징 기반 매개변수 내 편집이 언어 모델에서 개념적 지식을 제거하는 더 정밀하고 신뢰할 수 있는 접근 방식을 가능하게 함을 보여줍니다.

English

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

대규모 언어 모델에서 정밀한 매개변수 내 개념 삭제

Precise In-Parameter Concept Erasure in Large Language Models

초록

Support