超越缩放定律:通过关联记忆理解Transformer性能
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
May 14, 2024
作者: Xueyan Niu, Bo Bai, Lei Deng, Wei Han
cs.AI
摘要
增加Transformer模型的大小并不总是会导致性能的提升。这种现象无法用经验缩放定律来解释。此外,随着模型记忆训练样本,改善了泛化能力。我们提出了一个理论框架,阐明了基于Transformer的语言模型的记忆过程和性能动态。我们使用Hopfield网络模拟了具有关联记忆的Transformer的行为,使得每个Transformer块有效地进行近似最近邻搜索。基于此,我们设计了一个类似于现代连续Hopfield网络中的能量函数,为注意力机制提供了深刻的解释。利用主导极小化技术,我们构建了一个捕捉Transformer分层架构的全局能量函数。在特定条件下,我们表明最小可实现的交叉熵损失下界约为1。我们通过在各种数据大小上使用GPT-2进行实验以及在包含2M标记数据集上训练基本Transformer来证实我们的理论结果。
English
Increasing the size of a Transformer model does not always lead to enhanced
performance. This phenomenon cannot be explained by the empirical scaling laws.
Furthermore, improved generalization ability occurs as the model memorizes the
training samples. We present a theoretical framework that sheds light on the
memorization process and performance dynamics of transformer-based language
models. We model the behavior of Transformers with associative memories using
Hopfield networks, such that each transformer block effectively conducts an
approximate nearest-neighbor search. Based on this, we design an energy
function analogous to that in the modern continuous Hopfield network which
provides an insightful explanation for the attention mechanism. Using the
majorization-minimization technique, we construct a global energy function that
captures the layered architecture of the Transformer. Under specific
conditions, we show that the minimum achievable cross-entropy loss is bounded
from below by a constant approximately equal to 1. We substantiate our
theoretical results by conducting experiments with GPT-2 on various data sizes,
as well as training vanilla Transformers on a dataset of 2M tokens.Summary
AI-Generated Summary