Lightning Attention-2: 大規模言語モデルにおける無制限シーケンス長を扱うためのフリーランチ

要旨

線形アテンションは、従来のソフトマックスアテンションに代わる有望な手法として最近登場した効率的なアテンションメカニズムです。線形計算量でトークンを処理する能力により、理論的には、速度を犠牲にすることなく無限の長さのシーケンスを扱うことが可能です。つまり、固定のメモリ消費量で様々なシーケンス長に対して一定のトレーニング速度を維持できます。しかし、累積和（cumsum）の問題により、現在の線形アテンションアルゴリズムは因果的設定においてその理論的優位性を実証できません。本論文では、線形アテンションがその理論的計算上の利点を実現するための最初の実装であるLightning Attention-2を紹介します。これを達成するために、タイル化の考え方を活用し、線形アテンション計算におけるブロック内成分とブロック間成分を別々に処理します。具体的には、ブロック内成分には従来のアテンション計算メカニズムを利用し、ブロック間成分には線形アテンションのカーネルトリックを適用します。GPUハードウェアの利点を最大限に活用するために、フォワードおよびバックワードの両手順でタイル化技術を採用します。私たちは、IOを意識し、ハードウェアに優しい形でアルゴリズムをTritonで実装しました。様々なモデルサイズとシーケンス長で実験を行い、Lightning Attention-2は入力シーケンス長に関係なく一貫したトレーニングおよび推論速度を維持し、他のアテンションメカニズムよりも大幅に高速であることを確認しました。ソースコードはhttps://github.com/OpenNLPLab/lightning-attentionで公開されています。

English

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

Lightning Attention-2: 大規模言語モデルにおける無制限シーケンス長を扱うためのフリーランチ

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

要旨

Support