Transformer 誕生：一個記憶觀點

摘要

基於Transformer的大型語言模型取得了巨大的實證成功。然而，隨著它們被更廣泛地部署，迫切需要更好地理解其內部機制，以使其更可靠。這些模型似乎從訓練數據中儲存了大量知識，並能夠快速適應其上下文或提示中提供的新信息。我們通過考慮一個合成設置，其中token從全局或特定上下文的bigram分佈生成，來研究Transformer如何平衡這兩種知識類型。通過對簡化的兩層Transformer的訓練過程進行仔細的實證分析，我們說明了全局bigrams的快速學習以及上下文中bigrams的“歸納頭”機制的較慢發展。我們強調了權重矩陣作為聯想記憶的作用，提供了有關梯度如何在訓練期間實現其學習的理論見解，並研究了數據分佈特性的作用。

English

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

Transformer 誕生：一個記憶觀點

Birth of a Transformer: A Memory Viewpoint

摘要

Support