Transformer的诞生:一个记忆视角
Birth of a Transformer: A Memory Viewpoint
June 1, 2023
作者: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou
cs.AI
摘要
基于Transformer的大型语言模型取得了巨大的实证成功。然而,随着它们被广泛部署,人们越来越需要更好地理解它们的内部机制,以使它们更加可靠。这些模型似乎存储了大量来自训练数据的知识,并且能够快速适应其上下文或提示中提供的新信息。我们研究了Transformer如何平衡这两种知识类型,通过考虑一个合成设置,在这个设置中,token是根据全局或特定上下文的双字分布生成的。通过对简化的双层Transformer的训练过程进行仔细的实证分析,我们阐明了全局双字的快速学习以及用于上下文中的双字的“归纳头”机制的较慢发展。我们强调了权重矩阵作为联想记忆的作用,提供了关于梯度如何在训练过程中实现学习的理论见解,并研究了数据分布特性的作用。
English
Large language models based on transformers have achieved great empirical
successes. However, as they are deployed more widely, there is a growing need
to better understand their internal mechanisms in order to make them more
reliable. These models appear to store vast amounts of knowledge from their
training data, and to adapt quickly to new information provided in their
context or prompt. We study how transformers balance these two types of
knowledge by considering a synthetic setup where tokens are generated from
either global or context-specific bigram distributions. By a careful empirical
analysis of the training process on a simplified two-layer transformer, we
illustrate the fast learning of global bigrams and the slower development of an
"induction head" mechanism for the in-context bigrams. We highlight the role of
weight matrices as associative memories, provide theoretical insights on how
gradients enable their learning during training, and study the role of
data-distributional properties.