Transformer的诞生：一个记忆视角

摘要

基于Transformer的大型语言模型取得了巨大的实证成功。然而，随着它们被广泛部署，人们越来越需要更好地理解它们的内部机制，以使它们更加可靠。这些模型似乎存储了大量来自训练数据的知识，并且能够快速适应其上下文或提示中提供的新信息。我们研究了Transformer如何平衡这两种知识类型，通过考虑一个合成设置，在这个设置中，token是根据全局或特定上下文的双字分布生成的。通过对简化的双层Transformer的训练过程进行仔细的实证分析，我们阐明了全局双字的快速学习以及用于上下文中的双字的“归纳头”机制的较慢发展。我们强调了权重矩阵作为联想记忆的作用，提供了关于梯度如何在训练过程中实现学习的理论见解，并研究了数据分布特性的作用。

English

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

Transformer的诞生：一个记忆视角

Birth of a Transformer: A Memory Viewpoint

摘要

Support