大規模言語モデルの忘却を促すための埋め込み破損プロンプト

要旨

大規模言語モデル（LLM）は、多様な領域にわたる広範な知識を包含するまでに進化を遂げています。しかし、大規模言語モデルが知るべきでない情報を制御することは、整合性を確保し、安全な使用を実現する上で重要です。ただし、LLMから知識を正確かつ効率的に「忘却」させることは、保持と忘却の間の曖昧な境界によって引き起こされる潜在的な副作用や、数千億のパラメータを有する最先端モデル全体にわたる最適化に必要な膨大な計算リソースのため、依然として困難な課題です。本研究では、Embedding-COrrupted（ECO）Promptsを提案します。これは、知識の絡み合いと忘却効率の両方の課題に対処するための、大規模言語モデル向けの軽量な忘却フレームワークです。LLM自体に忘却を依存するのではなく、推論中に忘却状態を強制するために、プロンプト分類器を使用して忘却すべきプロンプトを識別し保護します。オフラインでゼロ次最適化を用いてプロンプト埋め込みに追加する破損を学習し、推論中に分類器によってフラグが立てられたプロンプトを破損させます。これらの埋め込み破損プロンプトは、忘却目標を満たす望ましい出力を導くだけでなく、忘却対象のデータで訓練されたことのないモデルからの出力に非常に近い結果をもたらすことがわかりました。忘却に関する広範な実験を通じて、本手法が一般的な領域および忘却対象と密接に関連する領域において、ほぼゼロの副作用で有望な忘却を達成する優位性を実証しました。さらに、0.5Bから236Bのパラメータにわたる100のLLMへのスケーラビリティを強調し、パラメータ数が増加しても追加コストが発生しないことを示しました。

English

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

大規模言語モデルの忘却を促すための埋め込み破損プロンプト

Large Language Model Unlearning via Embedding-Corrupted Prompts

要旨

Support