ChatPaper.aiChatPaper

論Softmax注意力機制之表達力:從循環神經網絡的視角探討

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

July 31, 2025
作者: Gabriel Mongaras, Eric C. Larson
cs.AI

摘要

自其引入以来,softmax注意力机制凭借其表达能力和在广泛任务中的可扩展性,已成为现代Transformer架构的核心。然而,softmax注意力的主要缺点在于其相对于序列长度的二次内存需求和计算复杂度。通过替换softmax非线性,线性注意力及类似方法被提出,以规避softmax注意力的二次瓶颈。尽管这些线性形式的注意力源自原始的softmax公式,它们在下游任务准确性方面通常表现欠佳。虽然关于softmax非线性在查询与键内积上的强大直觉表明,相比其他非线性,它具备更优特性,但为何存在这种差异的问题仍未得到解答。本研究通过推导softmax注意力的递归形式,证明了线性注意力是softmax注意力的一种近似。利用这一形式,softmax注意力的每一部分都可以用递归神经网络(RNNs)的语言来描述。将softmax注意力描述为RNN,使得能够通过消融实验来理解softmax注意力各组成部分的重要性及其相互作用方式。由此,我们的工作有助于解释为何softmax注意力比其替代方案更具表达力。
English
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts.
PDF22August 1, 2025