在語言模型中的段落記憶本地化

摘要

我們能否定位語言模型用來記憶和背誦整段訓練數據的權重和機制？在本文中，我們展示了儘管記憶分佈在多個層和模型組件中，但被記憶段落的梯度具有可識別的空間模式，較低層模型的梯度比非記憶範例更大。此外，被記憶的範例可以通過僅微調高梯度權重來取消學習。我們定位了一個低層關注頭，似乎特別參與段落記憶。這個關注頭主要將注意力集中在在語料庫級別單字分佈中最不常見的獨特罕見標記上。接下來，我們通過干擾標記並測量解碼引起的變化，研究了記憶在前綴中跨標記的本地化。前綴中的幾個獨特標記往往可以損壞整個延續。總的來說，被記憶的延續不僅更難取消學習，而且比非記憶的更難損壞。

English

Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

在語言模型中的段落記憶本地化

Localizing Paragraph Memorization in Language Models

摘要

Support