閃電注意力-2：處理大型語言模型中無限序列長度的免費午餐

摘要

線性注意力是一種高效的注意機制，最近作為傳統 softmax 注意力的一個有前途的替代方案而出現。線性注意力能夠以線性計算複雜度處理標記，理論上可以處理無限長度的序列而不降低速度，即在固定記憶體消耗下為不同序列長度保持恆定的訓練速度。然而，由於累積求和（cumsum）的問題，目前的線性注意力算法無法在因果設置中展現其理論優勢。本文提出了 Lightning Attention-2，這是第一個實現線性注意力以實現其理論計算優勢的線性注意力實現。為了實現這一目標，我們利用了平鋪思想，分別處理線性注意力計算中的區塊內和區塊間組件。具體來說，我們利用傳統的注意力計算機制來處理區塊內，並對區塊間應用線性注意力核技巧。平鋪技術通過前向和後向過程採用，以充分利用 GPU 硬件。我們在 Triton 中實現了我們的算法，使其具有 IO 意識和硬件友好性。在不同模型大小和序列長度上進行了各種實驗。Lightning Attention-2 不論輸入序列長度如何，保持一致的訓練和推斷速度，比其他注意力機制快得多。源代碼可在 https://github.com/OpenNLPLab/lightning-attention 找到。

English

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

閃電注意力-2：處理大型語言模型中無限序列長度的免費午餐

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

摘要

Support