SeerAttention：在您的LLM中学习内在稀疏注意力

摘要

关注是现代大型语言模型（LLMs）的基石。然而，其二次复杂度限制了LLMs的效率和可扩展性，特别是对于具有长上下文窗口的模型而言。解决这一限制的一种有前途的方法是利用关注中的稀疏性。然而，现有的基于稀疏性的解决方案主要依赖于预定义的模式或启发式方法来近似稀疏性。这种做法无法充分捕捉语言任务中关注稀疏性的动态特性。本文认为，应该学习而不是预定义关注稀疏性。为此，我们设计了SeerAttention，这是一种新的关注机制，通过一个可学习的门控机制来增强传统关注，自适应地选择关注图中的重要块，并将其余块视为稀疏。这种块级稀疏性有效地平衡了准确性和加速度。为了实现门控网络的高效学习，我们开发了一种定制的FlashAttention实现，以最小的开销提取关注图的块级真值。SeerAttention不仅适用于后训练，而且在长上下文微调中表现出色。我们的结果表明，在后训练阶段，SeerAttention明显优于最先进的基于静态或启发式的稀疏关注方法，同时更具通用性和灵活性，适应不同的上下文长度和稀疏比率。当应用于使用YaRN进行长上下文微调时，SeerAttention可以在32k上下文长度下实现显著的90%稀疏比率，几乎没有困惑度损失，比FlashAttention-2提供了5.67倍的加速度。

English

Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.

SeerAttention：在您的LLM中学习内在稀疏注意力

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

摘要

Support