注意力机制:具有余弦注意力的线性变换器
Cottention: Linear Transformers With Cosine Attention
September 27, 2024
作者: Gabriel Mongaras, Trevor Dohm, Eric C. Larson
cs.AI
摘要
注意机制,特别是softmax注意力,在基于transformer的模型(如GPT)取得成功方面起到了关键作用。然而,相对于序列长度而言,softmax注意力的二次内存复杂度给处理更长序列带来了重大挑战。我们引入了Cottention,一种新颖的注意力机制,它将softmax操作替换为余弦相似度。通过利用余弦相似度的特性并重新排列注意力方程,Cottention 实现了相对于序列长度的本地线性内存复杂度,使其比softmax注意力更具内在的内存效率。我们证明了Cottention 可以重新构建为具有有限隐藏状态的循环神经网络(RNN),从而允许在推断期间保持恒定的内存使用量。我们在双向BERT和因果GPT任务上评估了Cottention,表明它在显著减少内存需求的同时,性能与softmax注意力相当。为了确保高效计算,我们为Cottention 开发了一个自定义CUDA核心。我们的结果表明,Cottention 是softmax注意力的一个有前途的替代方案,能够处理更长序列而不牺牲性能,这是由于其本地线性内存复杂度和在推断期间保持恒定内存占用的能力。
English
Attention mechanisms, particularly softmax attention, have been instrumental
in the success of transformer-based models such as GPT. However, the quadratic
memory complexity of softmax attention with respect to sequence length poses
significant challenges for processing longer sequences. We introduce
Cottention, a novel attention mechanism that replaces the softmax operation
with cosine similarity. By leveraging the properties of cosine similarity and
rearranging the attention equation, Cottention achieves native linear memory
complexity with respect to sequence length, making it inherently more
memory-efficient than softmax attention. We demonstrate that Cottention can be
reformulated as a recurrent neural network (RNN) with a finite hidden state,
allowing for constant memory usage during inference. We evaluate Cottention on
both the bidirectional BERT and causal GPT tasks, demonstrating comparable
performance to softmax attention while significantly reducing memory
requirements. To ensure efficient computation, we develop a custom CUDA kernel
for Cottention. Our results show that Cottention is a promising alternative to
softmax attention, enabling the processing of longer sequences without
sacrificing performance, due to its native linear memory complexity and ability
to maintain a constant memory footprint during inference.Summary
AI-Generated Summary