大型語言模型中的精確參數內概念擦除

摘要

大型語言模型（LLMs）在預訓練過程中常會習得於下游部署中不適宜的知識，例如敏感資訊或受版權保護的內容。現有移除此類知識的方法依賴於微調、訓練低秩適配器或事實層面的編輯，但這些方法要么過於粗糙，要么過於淺顯，或效果不佳。在本研究中，我們提出了PISCES（精確參數內抑制概念消除），這是一種新穎的框架，旨在通過直接編輯參數空間中編碼概念的方向，從模型參數中精確地消除整個概念。PISCES利用解耦模型將多層感知器（MLP）向量分解為可解釋的特徵，使用自動化可解釋性技術識別與目標概念相關的特徵，並將其從模型參數中移除。在Gemma 2和Llama 3.1上針對多種概念的實驗顯示，PISCES在消除效果上相較於領先的消除方法取得了適度的提升，將目標概念的準確率降低至最低7.7%，同時顯著提高了消除的特異性（最高提升31%）和魯棒性（最高提升38%）。總體而言，這些結果表明，基於特徵的參數內編輯為語言模型中的概念知識移除提供了一種更為精確且可靠的方法。

English

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

大型語言模型中的精確參數內概念擦除

Precise In-Parameter Concept Erasure in Large Language Models

摘要

Support