在語言模型中的段落記憶本地化
Localizing Paragraph Memorization in Language Models
March 28, 2024
作者: Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, Owen Lewis
cs.AI
摘要
我們能否定位語言模型用來記憶和背誦整段訓練數據的權重和機制?在本文中,我們展示了儘管記憶分佈在多個層和模型組件中,但被記憶段落的梯度具有可識別的空間模式,較低層模型的梯度比非記憶範例更大。此外,被記憶的範例可以通過僅微調高梯度權重來取消學習。我們定位了一個低層關注頭,似乎特別參與段落記憶。這個關注頭主要將注意力集中在在語料庫級別單字分佈中最不常見的獨特罕見標記上。接下來,我們通過干擾標記並測量解碼引起的變化,研究了記憶在前綴中跨標記的本地化。前綴中的幾個獨特標記往往可以損壞整個延續。總的來說,被記憶的延續不僅更難取消學習,而且比非記憶的更難損壞。
English
Can we localize the weights and mechanisms used by a language model to
memorize and recite entire paragraphs of its training data? In this paper, we
show that while memorization is spread across multiple layers and model
components, gradients of memorized paragraphs have a distinguishable spatial
pattern, being larger in lower model layers than gradients of non-memorized
examples. Moreover, the memorized examples can be unlearned by fine-tuning only
the high-gradient weights. We localize a low-layer attention head that appears
to be especially involved in paragraph memorization. This head is predominantly
focusing its attention on distinctive, rare tokens that are least frequent in a
corpus-level unigram distribution. Next, we study how localized memorization is
across the tokens in the prefix by perturbing tokens and measuring the caused
change in the decoding. A few distinctive tokens early in a prefix can often
corrupt the entire continuation. Overall, memorized continuations are not only
harder to unlearn, but also to corrupt than non-memorized ones.Summary
AI-Generated Summary