注意力機制中歸一化的局限性

摘要

本文探討了注意力機制中歸一化的局限性。我們首先建立了一個理論框架，該框架能夠識別模型的選擇能力以及涉及標記選擇的幾何分離。我們的分析包括在softmax縮放下對標記向量距離和分離標準的明確界限。通過對預訓練GPT-2模型的實驗，我們從經驗上驗證了理論結果，並分析了注意力機制的關鍵行為。值得注意的是，我們證明隨著選擇標記數量的增加，模型區分信息性標記的能力下降，往往趨向於均勻選擇模式。我們還展示了softmax歸一化下的梯度敏感性在訓練過程中帶來挑戰，特別是在低溫設置下。這些發現增進了當前對基於softmax的注意力機制的理解，並激發了未來注意力架構中對更穩健的歸一化和選擇策略的需求。

English

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

注意力機制中歸一化的局限性

Limitations of Normalization in Attention Mechanism

摘要

Support