通過嵌入式損壞提示進行大型語言模型遺忘
Large Language Model Unlearning via Embedding-Corrupted Prompts
June 12, 2024
作者: Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, Yang Liu
cs.AI
摘要
大型語言模型(LLMs)已經發展到涵蓋廣泛領域的廣泛知識。然而,控制大型語言模型不應該知道的內容對確保對齊性和安全使用至關重要。然而,由於保留和遺忘之間模糊邊界可能導致的潛在附帶損害,以及在具有數百億參數的最先進模型上進行優化所需的大量計算,因此從LLM中準確且高效地遺忘知識仍然具有挑戰性。在這項工作中,我們提出了Embedding-COrrupted(ECO)Prompts,這是一個輕量級的大型語言模型遺忘框架,旨在應對知識交織和遺忘效率的挑戰。我們不依賴LLM本身進行遺忘,而是通過使用提示分類器在推論期間實施一個已遺忘狀態,以識別和保護應該遺忘的提示。我們通過零階優化學習對提示嵌入添加的損壞,以實現離線遺忘目標,並在推論期間通過分類器標記損壞的提示。我們發現這些嵌入損壞的提示不僅產生符合遺忘目標的理想輸出,而且與從未接受過旨在遺忘的數據訓練的模型的輸出非常接近。通過大量遺忘實驗,我們展示了我們的方法在一般領域和與遺忘領域密切相關的領域中實現了幾乎零副作用的優越性。此外,我們強調了我們的方法在100個LLMs的可擴展性,這些LLMs的參數範圍從0.5B到236B,隨著參數數量增加,不會產生額外成本。
English
Large language models (LLMs) have advanced to encompass extensive knowledge
across diverse domains. Yet controlling what a large language model should not
know is important for ensuring alignment and thus safe use. However, accurately
and efficiently unlearning knowledge from an LLM remains challenging due to the
potential collateral damage caused by the fuzzy boundary between retention and
forgetting, and the large computational requirements for optimization across
state-of-the-art models with hundreds of billions of parameters. In this work,
we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning
framework for large language models to address both the challenges of knowledge
entanglement and unlearning efficiency. Instead of relying on the LLM itself to
unlearn, we enforce an unlearned state during inference by employing a prompt
classifier to identify and safeguard prompts to forget. We learn corruptions
added to prompt embeddings via zeroth order optimization toward the unlearning
objective offline and corrupt prompts flagged by the classifier during
inference. We find that these embedding-corrupted prompts not only lead to
desirable outputs that satisfy the unlearning objective but also closely
approximate the output from a model that has never been trained on the data
intended for forgetting. Through extensive experiments on unlearning, we
demonstrate the superiority of our method in achieving promising unlearning at
nearly zero side effects in general domains and domains closely related to the
unlearned ones. Additionally, we highlight the scalability of our method to 100
LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the
number of parameters increases.Summary
AI-Generated Summary