注意力机制：具有余弦注意力的线性变换器

摘要

注意机制，特别是softmax注意力，在基于transformer的模型（如GPT）取得成功方面起到了关键作用。然而，相对于序列长度而言，softmax注意力的二次内存复杂度给处理更长序列带来了重大挑战。我们引入了Cottention，一种新颖的注意力机制，它将softmax操作替换为余弦相似度。通过利用余弦相似度的特性并重新排列注意力方程，Cottention 实现了相对于序列长度的本地线性内存复杂度，使其比softmax注意力更具内在的内存效率。我们证明了Cottention 可以重新构建为具有有限隐藏状态的循环神经网络（RNN），从而允许在推断期间保持恒定的内存使用量。我们在双向BERT和因果GPT任务上评估了Cottention，表明它在显著减少内存需求的同时，性能与softmax注意力相当。为了确保高效计算，我们为Cottention 开发了一个自定义CUDA核心。我们的结果表明，Cottention 是softmax注意力的一个有前途的替代方案，能够处理更长序列而不牺牲性能，这是由于其本地线性内存复杂度和在推断期间保持恒定内存占用的能力。

English

Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

注意力机制：具有余弦注意力的线性变换器

Cottention: Linear Transformers With Cosine Attention

摘要

Support