定位语言模型中的段落记忆
Localizing Paragraph Memorization in Language Models
March 28, 2024
作者: Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, Owen Lewis
cs.AI
摘要
我们能否定位语言模型用于记忆和背诵其训练数据中整段文本的权重和机制?本文表明,尽管记忆分布在多个层和模型组件中,但记忆段落的梯度具有可区分的空间模式,即在较低模型层中比非记忆示例的梯度更大。此外,通过仅微调高梯度权重,可以消除这些记忆示例。我们定位了一个低层注意力头,该头似乎特别参与段落记忆。此头主要将其注意力集中在语料库级别的一元分布中频率最低的独特、罕见标记上。接着,我们通过扰动标记并测量解码中的变化,研究了前缀中标记的局部化记忆情况。前缀中早期的几个独特标记通常会破坏整个后续内容。总体而言,记忆的后续内容不仅更难消除,而且比非记忆内容更难破坏。
English
Can we localize the weights and mechanisms used by a language model to
memorize and recite entire paragraphs of its training data? In this paper, we
show that while memorization is spread across multiple layers and model
components, gradients of memorized paragraphs have a distinguishable spatial
pattern, being larger in lower model layers than gradients of non-memorized
examples. Moreover, the memorized examples can be unlearned by fine-tuning only
the high-gradient weights. We localize a low-layer attention head that appears
to be especially involved in paragraph memorization. This head is predominantly
focusing its attention on distinctive, rare tokens that are least frequent in a
corpus-level unigram distribution. Next, we study how localized memorization is
across the tokens in the prefix by perturbing tokens and measuring the caused
change in the decoding. A few distinctive tokens early in a prefix can often
corrupt the entire continuation. Overall, memorized continuations are not only
harder to unlearn, but also to corrupt than non-memorized ones.Summary
AI-Generated Summary