定位语言模型中的段落记忆

摘要

我们能否定位语言模型用于记忆和背诵其训练数据中整段文本的权重和机制？本文表明，尽管记忆分布在多个层和模型组件中，但记忆段落的梯度具有可区分的空间模式，即在较低模型层中比非记忆示例的梯度更大。此外，通过仅微调高梯度权重，可以消除这些记忆示例。我们定位了一个低层注意力头，该头似乎特别参与段落记忆。此头主要将其注意力集中在语料库级别的一元分布中频率最低的独特、罕见标记上。接着，我们通过扰动标记并测量解码中的变化，研究了前缀中标记的局部化记忆情况。前缀中早期的几个独特标记通常会破坏整个后续内容。总体而言，记忆的后续内容不仅更难消除，而且比非记忆内容更难破坏。

English

Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of non-memorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

定位语言模型中的段落记忆

Localizing Paragraph Memorization in Language Models

摘要

Support