稀疏即快速，少即是多：長距離Transformer的高效稀疏注意力

摘要

在自回歸Transformer中有效地處理長序列，特別是在擴展的上下文窗口內，由於自注意機制中的二次計算複雜度和大量KV記憶體需求，會帶來顯著挑戰。在這項研究中，我們介紹了SPARSEK Attention，一種新穎的稀疏注意機制，旨在克服這些計算和記憶體障礙，同時保持性能。我們的方法整合了一個評分網絡和一個可微的top-k遮罩運算子SPARSEK，以選擇每個查詢的恆定數量的KV對，從而實現基於梯度的優化。因此，SPARSEK Attention提供了線性時間複雜度和生成過程中的恆定記憶體占用。實驗結果顯示，SPARSEK Attention優於先前的稀疏注意方法，在訓練和推理過程中提供了顯著的速度改進，特別是在語言建模和下游任務中。此外，我們的方法可以無縫集成到預訓練的大型語言模型（LLMs）中，只需進行最少的微調，為有效管理各種應用中的長距離依賴性提供了實用解決方案。

English

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

稀疏即快速，少即是多：長距離Transformer的高效稀疏注意力

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

摘要

Summary

Support