注意力机制中归一化的局限性
Limitations of Normalization in Attention Mechanism
August 25, 2025
作者: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State
cs.AI
摘要
本文深入探讨了注意力机制中归一化操作的局限性。我们首先构建了一个理论框架,该框架能够识别模型的选择能力以及涉及token选择的几何分离特性。我们的分析包括在softmax缩放下token向量距离和分离准则的明确界限。通过对预训练GPT-2模型的实验,我们实证验证了理论结果,并分析了注意力机制的关键行为。特别地,我们证明了随着所选token数量的增加,模型区分信息性token的能力下降,往往趋向于均匀选择模式。我们还展示了softmax归一化下的梯度敏感性在训练过程中,尤其是在低温设置下,带来了挑战。这些发现深化了当前对基于softmax的注意力机制的理解,并激励未来注意力架构中需要更鲁棒的归一化和选择策略。
English
This paper investigates the limitations of the normalization in attention
mechanisms. We begin with a theoretical framework that enables the
identification of the model's selective ability and the geometric separation
involved in token selection. Our analysis includes explicit bounds on distances
and separation criteria for token vectors under softmax scaling. Through
experiments with pre-trained GPT-2 model, we empirically validate our
theoretical results and analyze key behaviors of the attention mechanism.
Notably, we demonstrate that as the number of selected tokens increases, the
model's ability to distinguish informative tokens declines, often converging
toward a uniform selection pattern. We also show that gradient sensitivity
under softmax normalization presents challenges during training, especially at
low temperature settings. These findings advance current understanding of
softmax-based attention mechanism and motivate the need for more robust
normalization and selection strategies in future attention architectures.