可擴展Softmax在注意力機制中表現優異。

摘要

Softmax 函數輸出的向量中，最大元素隨著輸入向量大小的增加而趨近於零。基於 Transformer 的語言模型依賴 Softmax 來計算注意力分數，隨著上下文大小的增加，導致注意力分佈變得平坦。這降低了模型有效優先處理關鍵信息的能力，並潛在地限制了其長度泛化能力。為解決此問題，我們提出了可擴展 Softmax（SSMax），在輸入向量大小變化的情況下取代 Softmax。SSMax 可無縫集成到現有的基於 Transformer 的架構中。在語言建模的實驗結果中顯示，使用 SSMax 的模型不僅在預訓練期間實現更快的損失減少，而且在長上下文和關鍵信息檢索方面顯著提高性能。此外，注意力分數的分析顯示，SSMax 使模型能夠在長上下文中專注於關鍵信息。此外，儘管從預訓練開始使用 SSMax 的模型實現更好的長度泛化，但已經開始預訓練的模型仍可以通過在注意力層中將 Softmax 替換為 SSMax（在預訓練期間或之後）來獲得部分此能力。

English

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

可擴展Softmax在注意力機制中表現優異。

Scalable-Softmax Is Superior for Attention

摘要

Support