超越缩放定律：通过关联记忆理解Transformer性能

摘要

增加Transformer模型的大小并不总是会导致性能的提升。这种现象无法用经验缩放定律来解释。此外，随着模型记忆训练样本，改善了泛化能力。我们提出了一个理论框架，阐明了基于Transformer的语言模型的记忆过程和性能动态。我们使用Hopfield网络模拟了具有关联记忆的Transformer的行为，使得每个Transformer块有效地进行近似最近邻搜索。基于此，我们设计了一个类似于现代连续Hopfield网络中的能量函数，为注意力机制提供了深刻的解释。利用主导极小化技术，我们构建了一个捕捉Transformer分层架构的全局能量函数。在特定条件下，我们表明最小可实现的交叉熵损失下界约为1。我们通过在各种数据大小上使用GPT-2进行实验以及在包含2M标记数据集上训练基本Transformer来证实我们的理论结果。

English

Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we design an energy function analogous to that in the modern continuous Hopfield network which provides an insightful explanation for the attention mechanism. Using the majorization-minimization technique, we construct a global energy function that captures the layered architecture of the Transformer. Under specific conditions, we show that the minimum achievable cross-entropy loss is bounded from below by a constant approximately equal to 1. We substantiate our theoretical results by conducting experiments with GPT-2 on various data sizes, as well as training vanilla Transformers on a dataset of 2M tokens.

超越缩放定律：通过关联记忆理解Transformer性能

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

摘要

Support