稀疏即快速，少即是多：长程Transformer的高效稀疏注意力

摘要

在自回归Transformer中高效地处理长序列，尤其是在扩展上下文窗口内，由于自注意机制中固有的二次计算复杂度和大量KV内存需求，面临着重大挑战。在这项工作中，我们引入了SPARSEK注意力机制，这是一种新颖的稀疏注意力机制，旨在克服这些计算和内存障碍，同时保持性能。我们的方法整合了一个评分网络和一个可微的top-k掩码运算符SPARSEK，以选择每个查询的恒定数量的KV对，从而实现基于梯度的优化。因此，SPARSEK注意力在生成过程中提供了线性时间复杂度和恒定内存占用。实验结果显示，SPARSEK注意力优于先前的稀疏注意力方法，并在训练和推断过程中提供了显著的速度改进，特别是在语言建模和下游任务中。此外，我们的方法可以无缝集成到预训练的大型语言模型（LLMs）中，只需进行最少的微调，为有效处理各种应用中的长距离依赖关系提供了实用解决方案。

English

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

稀疏即快速，少即是多：长程Transformer的高效稀疏注意力

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

摘要

Support