大型语言模型中的精确参数内概念擦除

摘要

大型语言模型（LLMs）在预训练过程中常会习得一些在下游部署中不期望存在的知识，例如敏感信息或受版权保护的内容。现有的知识移除方法依赖于微调、训练低秩适配器或事实级别编辑，但这些方法要么过于粗略，要么过于浅显，或效果不佳。在本研究中，我们提出了PISCES（参数内精确抑制概念擦除），这是一种新颖的框架，通过直接编辑参数空间中编码特定概念的方向，来精确地从模型参数中擦除整个概念。PISCES利用解耦模型将多层感知器（MLP）向量分解为可解释的特征，通过自动化可解释性技术识别与目标概念相关的特征，并将它们从模型参数中移除。在Gemma 2和Llama 3.1模型上针对多种概念的实验表明，PISCES在擦除效果上较领先的擦除方法取得了适度提升，将目标概念的准确率降低至最低7.7%，同时显著提高了擦除的精确性（提升高达31%）和鲁棒性（提升高达38%）。总体而言，这些结果表明，基于特征的参数内编辑为语言模型中的概念知识移除提供了一种更为精确和可靠的方法。

English

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

大型语言模型中的精确参数内概念擦除

Precise In-Parameter Concept Erasure in Large Language Models

摘要

Support