Sparserはより高速で、少ないほど良い：長距離Transformerのための効率的なスパースアテンション

要旨

自己回帰型Transformerにおいて、特に拡張されたコンテキストウィンドウ内での長いシーケンスを効率的に処理することは、自己注意機構に内在する二次的な計算複雑性と膨大なKVメモリ要件により、大きな課題となっています。本研究では、これらの計算とメモリの障壁を克服しつつ性能を維持するために設計された新しいスパース注意機構であるSPARSEK Attentionを提案します。我々のアプローチでは、スコアリングネットワークと微分可能なtop-kマスク演算子SPARSEKを統合し、各クエリに対して一定数のKVペアを選択することで、勾配ベースの最適化を可能にします。その結果、SPARSEK Attentionは生成時に線形時間複雑性と一定のメモリフットプリントを提供します。実験結果から、SPARSEK Attentionは従来のスパース注意手法を上回り、特に言語モデリングや下流タスクにおいて、学習と推論の両方で大幅な速度向上をもたらすことが明らかになりました。さらに、我々の手法は、最小限のファインチューニングで事前学習済みの大規模言語モデル（LLM）にシームレスに統合可能であり、多様なアプリケーションにおける長距離依存関係を効果的に管理するための実用的なソリューションを提供します。

English

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

Sparserはより高速で、少ないほど良い：長距離Transformerのための効率的なスパースアテンション

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

要旨

Support