ChatPaper.aiChatPaper

通过令牌重排实现更稀疏的块稀疏注意力机制

Sparser Block-Sparse Attention via Token Permutation

October 24, 2025
作者: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
cs.AI

摘要

扩展大型语言模型(LLM)的上下文长度虽能带来显著优势,但计算成本高昂。这一开销主要源于自注意力机制——其相对于序列长度的O(N²)复杂度成为内存和延迟的主要瓶颈。幸运的是,注意力矩阵通常具有稀疏性,尤其在长序列场景下,这为优化提供了契机。块稀疏注意力通过将序列划分为块并跳过部分块的计算,成为一种颇具前景的解决方案。然而,该方法的效果高度依赖于底层的注意力模式,可能导致次优的块级稀疏度。例如,单个块内查询所需的重要键令牌可能分散在多个其他块中,从而导致计算冗余。本文提出置换块稀疏注意力(PBS-Attn),这是一种即插即用方法,利用注意力的置换特性增强块级稀疏度,提升LLM预填充阶段的计算效率。我们在具有挑战性的真实长上下文数据集上进行了全面实验,证明PBS-Attn在模型准确性上持续优于现有块稀疏注意力方法,并与全注意力基准结果高度吻合。借助我们定制的置换FlashAttention内核,PBS-Attn在长上下文预填充中实现了最高2.75倍的端到端加速,验证了其实际可行性。代码已开源:https://github.com/xinghaow99/pbs-attn
English
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose O(N^2) complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (PBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to 2.75times in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn
PDF241December 17, 2025