大規模言語モデルにおける正確なパラメータ内概念消去

要旨

大規模言語モデル（LLMs）は、事前学習中に望ましくない知識を取得することが多く、例えば機密情報や著作権保護されたコンテンツなどが含まれる。これまで、そのような知識を除去するための既存のアプローチは、ファインチューニング、低ランクアダプターのトレーニング、または事実レベルの編集に依存していたが、これらは粗すぎる、浅すぎる、または効果的でない場合があった。本研究では、PISCES（Precise In-parameter Suppression for Concept EraSure）を提案する。これは、パラメータ空間内で概念をエンコードする方向を直接編集することで、モデルパラメータから概念全体を正確に消去する新しいフレームワークである。PISCESは、解離モデルを使用してMLPベクトルを解釈可能な特徴に分解し、自動解釈技術を用いて対象概念に関連する特徴を特定し、それらをモデルパラメータから除去する。Gemma 2およびLlama 3.1を用いた様々な概念に対する実験では、PISCESが主要な消去手法よりも効果においてわずかな向上を示し、対象概念に対する精度を7.7%まで低下させると同時に、消去の特異性（最大31%）と堅牢性（最大38%）を大幅に改善した。全体として、これらの結果は、特徴ベースのパラメータ内編集が、言語モデルから概念的知識を除去するためのより正確で信頼性の高いアプローチを可能にすることを示している。

English

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

大規模言語モデルにおける正確なパラメータ内概念消去

Precise In-Parameter Concept Erasure in Large Language Models

要旨

Support