闪电注意力-2：处理大型语言模型中无限序列长度的免费午餐

摘要

线性注意力是一种高效的注意力机制，最近作为传统 softmax 注意力的一种有前途的替代方案而出现。线性注意力能够以线性计算复杂度处理标记，理论上可以处理长度不受限制的序列而不降低速度，即在固定内存消耗下为不同序列长度保持恒定的训练速度。然而，由于累积求和（cumsum）存在问题，目前的线性注意力算法无法在因果设置中展现其理论优势。本文提出了 Lightning Attention-2，这是第一个能够实现线性注意力理论计算优势的线性注意力实现。为了实现这一目标，我们利用了平铺的思想，分别处理线性注意力计算中的块内和块间组件。具体来说，我们利用传统的注意力计算机制处理块内部分，并为块间部分应用线性注意力核技巧。我们通过前向和后向过程采用了一种平铺技术，充分利用 GPU 硬件。我们在 Triton 中实现了我们的算法，使其具有 IO 意识并且友好于硬件。我们在不同模型大小和序列长度上进行了各种实验。Lightning Attention-2 保持了一致的训练和推断速度，不受输入序列长度影响，并且比其他注意力机制快得多。源代码可在 https://github.com/OpenNLPLab/lightning-attention 找到。

English

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

闪电注意力-2：处理大型语言模型中无限序列长度的免费午餐

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

摘要

Support