通过嵌入式损坏提示进行大型语言模型遗忘

摘要

大型语言模型（LLMs）已经发展到涵盖各个领域的广泛知识。然而，控制大型语言模型不应该知道的内容对于确保对齐性和安全使用至关重要。然而，由于保留和遗忘之间模糊边界可能导致的潜在附带损害，以及在具有数千亿参数的最先进模型上进行优化所需的大量计算资源，准确高效地从LLM中遗忘知识仍然具有挑战性。在这项工作中，我们提出了Embedding-COrrupted（ECO）Prompts，这是一个轻量级的大型语言模型遗忘框架，旨在解决知识纠缠和遗忘效率方面的挑战。我们不依赖LLM本身进行遗忘，而是通过使用提示分类器在推理过程中强制实现一个已遗忘状态，以识别和保护需要遗忘的提示。我们通过零阶优化学习离线添加到提示嵌入中的破坏，朝向遗忘目标，并在推理过程中标记分类器标记的破坏提示。我们发现，这些嵌入破坏的提示不仅导致符合遗忘目标的理想输出，而且与从未受过针对遗忘数据训练的模型的输出非常接近。通过对遗忘进行广泛实验，我们展示了我们的方法在一般领域和与遗忘领域密切相关的领域中实现了几乎零副作用的有前途的遗忘的优越性。此外，我们强调了我们的方法在100个LLMs上的可扩展性，这些LLMs的参数范围从0.5B到236B，随着参数数量的增加不会产生额外成本。

English

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

通过嵌入式损坏提示进行大型语言模型遗忘

Large Language Model Unlearning via Embedding-Corrupted Prompts

摘要

Support