임베딩 손상 프롬프트를 통한 대형 언어 모델 언러닝

초록

대규모 언어 모델(LLM)은 다양한 도메인에 걸친 광범위한 지식을 포괄할 정도로 발전했습니다. 그러나 대규모 언어 모델이 알아서는 안 되는 것을 통제하는 것은 정렬을 보장하고 안전한 사용을 위해 중요합니다. 하지만, LLM에서 지식을 정확하고 효율적으로 언러닝(unlearning)하는 것은 여전히 어려운 과제입니다. 이는 보유와 망각 사이의 모호한 경계로 인한 잠재적인 부수적 손상과, 수천억 개의 파라미터를 가진 최신 모델들에 걸친 최적화를 위한 큰 계산 요구량 때문입니다. 본 연구에서는 이러한 지식 얽힘(knowledge entanglement)과 언러닝 효율성의 문제를 해결하기 위해, 경량화된 언러닝 프레임워크인 Embedding-COrrupted (ECO) Prompts를 제안합니다. LLM 자체가 언러닝을 수행하도록 의존하는 대신, 추론 과정에서 언러닝된 상태를 강제하기 위해 프롬프트 분류기를 사용하여 망각해야 할 프롬프트를 식별하고 보호합니다. 오프라인에서 제로스 오더 최적화(zero-order optimization)를 통해 프롬프트 임베딩에 추가할 왜곡(corruption)을 학습하고, 추론 중에 분류기에 의해 플래그된 프롬프트를 왜곡합니다. 우리는 이러한 임베딩 왜곡된 프롬프트가 언러닝 목표를 충족하는 바람직한 출력을 생성할 뿐만 아니라, 망각하려는 데이터로 훈련되지 않은 모델의 출력에 근접한 결과를 도출함을 발견했습니다. 다양한 언러닝 실험을 통해, 우리의 방법이 일반 도메인과 언러닝된 도메인과 밀접한 관련이 있는 도메인에서 거의 부작용 없이 유망한 언러닝을 달성하는 데 있어 우수성을 입증했습니다. 또한, 0.5B에서 236B 파라미터에 이르는 100개의 LLM에 대한 우리 방법의 확장성을 강조하며, 파라미터 수가 증가함에 따라 추가 비용이 발생하지 않음을 보여줍니다.

English

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

임베딩 손상 프롬프트를 통한 대형 언어 모델 언러닝

Large Language Model Unlearning via Embedding-Corrupted Prompts

초록

Support