注意力机制中归一化的局限性

摘要

本文深入探讨了注意力机制中归一化操作的局限性。我们首先构建了一个理论框架，该框架能够识别模型的选择能力以及涉及token选择的几何分离特性。我们的分析包括在softmax缩放下token向量距离和分离准则的明确界限。通过对预训练GPT-2模型的实验，我们实证验证了理论结果，并分析了注意力机制的关键行为。特别地，我们证明了随着所选token数量的增加，模型区分信息性token的能力下降，往往趋向于均匀选择模式。我们还展示了softmax归一化下的梯度敏感性在训练过程中，尤其是在低温设置下，带来了挑战。这些发现深化了当前对基于softmax的注意力机制的理解，并激励未来注意力架构中需要更鲁棒的归一化和选择策略。

English

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

注意力机制中归一化的局限性

Limitations of Normalization in Attention Mechanism

摘要

Support