稀疏注意力机制：精準加速任何模型推理的解決方案

摘要

高效的注意力機制實現對於大型模型至關重要，這是由於其二次方時間複雜度的特性。幸運的是，注意力機制通常表現出稀疏性，即注意力圖中的許多值接近於零，這使得我們可以省略相應的計算。許多研究已經利用這種稀疏模式來加速注意力計算。然而，現有的大多數工作主要集中於通過利用注意力圖的特定稀疏模式來優化特定模型內的注意力計算。一種既能保證速度提升又能確保多樣化模型端到端性能的通用稀疏注意力機制仍然難以實現。在本論文中，我們提出了SpargeAttn，一種適用於任何模型的通用稀疏量化注意力機制。我們的方法採用了一種兩階段在線過濾器：在第一階段，我們快速且準確地預測注意力圖，從而能夠跳過注意力計算中的某些矩陣乘法。在第二階段，我們設計了一種無額外開銷的在線softmax感知過濾器，進一步跳過一些矩陣乘法。實驗表明，我們的方法在不犧牲端到端指標的情況下，顯著加速了包括語言、圖像和視頻生成在內的多樣化模型。代碼可在https://github.com/thu-ml/SpargeAttn獲取。

English

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.

稀疏注意力机制：精準加速任何模型推理的解決方案

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

摘要

Support