反學習：在先進生成式人工智慧中，反學習並不足以進行內容規範。

摘要

精確的遺忘最初被引入作為一種隱私機制，允許用戶根據要求從機器學習模型中撤回其數據。不久之後，提出了不精確的方案來減輕與精確遺忘相關的不切實際成本。最近，遺忘通常被討論為一種移除不允許的知識的方法，即模型不應該擁有的知識，例如未經許可的版權、不準確或惡意信息。承諾是，如果模型沒有某種惡意能力，那麼它就無法用於相應的惡意目的。在本文中，我們重新審視了遺忘在大型語言模型（LLMs）中的應用範式，並突出了由於情境學習而產生的潛在不一致性。遺忘可以是訓練階段的一種有效控制機制，但它無法防止模型在推論過程中執行不允許的行為。我們引入了一個“反遺忘”的概念，其中被遺忘的知識在情境中重新引入，有效地使模型能夠表現得好像它知道已遺忘的知識。因此，我們認為將需要對不允許的知識進行內容篩選，即使是精確的遺忘方案對於有效的內容監管也是不夠的。我們討論了對於現代LLMs的反遺忘的可行性並檢視了更廣泛的影響。

English

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

反學習：在先進生成式人工智慧中，反學習並不足以進行內容規範。

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

摘要

Support