SpargeAttn：あらゆるモデル推論を加速する高精度スパースアテンション

要旨

大規模モデルにおいて、その二次的な時間計算量のため、効率的なアテンション実装は不可欠です。幸いなことに、アテンションはしばしばスパース性を示し、すなわちアテンションマップ内の多くの値がゼロに近いため、対応する計算を省略することが可能です。多くの研究がこのスパースパターンを活用してアテンションを加速してきました。しかし、既存の研究のほとんどは、特定のモデル内でアテンションマップの特定のスパースパターンを利用してアテンションを最適化することに焦点を当てています。多様なモデルにおいて速度向上とエンドツーエンドの性能を両立する普遍的なスパースアテンションは、まだ実現されていません。本論文では、任意のモデルに適用可能な普遍的なスパースかつ量子化されたアテンションであるSpargeAttnを提案します。私たちの手法は、二段階のオンラインフィルタを使用します。第一段階では、アテンションマップを迅速かつ正確に予測し、アテンション内のいくつかの行列乗算をスキップできるようにします。第二段階では、追加のオーバーヘッドを発生させず、さらにいくつかの行列乗算をスキップするオンラインソフトマックス対応フィルタを設計します。実験結果から、私たちの手法が言語、画像、動画生成を含む多様なモデルをエンドツーエンドのメトリクスを犠牲にすることなく大幅に加速することが示されています。コードはhttps://github.com/thu-ml/SpargeAttnで公開されています。

English

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.

SpargeAttn：あらゆるモデル推論を加速する高精度スパースアテンション

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

要旨

Support