ChatPaper.aiChatPaper

透過權杖置換實現更稀疏的區塊稀疏注意力機制 (注:此處採用"權杖"而非"令牌"作為token的譯名,因更符合深度學習領域的術語習慣;"區塊稀疏"為block-sparse的標準譯法,強調矩陣分塊特性;"注意力機制"為attention的固定譯名,完整保留技術概念)

Sparser Block-Sparse Attention via Token Permutation

October 24, 2025
作者: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
cs.AI

摘要

擴增大語言模型(LLM)的上下文長度雖能帶來顯著優勢,但計算成本高昂。這種開銷主要源於自注意力機制——其相對於序列長度的 O(N²) 複雜度成為記憶體與延遲的主要瓶頸。所幸注意力矩陣通常具有稀疏性,尤其在長序列中更為明顯,這為優化提供了契機。區塊稀疏注意力應運而生,其將序列分割為多個區塊,並跳過部分區塊的計算。然而,該方法的有效性高度依賴底層注意力模式,可能導致次優的區塊級稀疏度。例如,單一區塊內查詢所需的關鍵詞元可能分散於多個其他區塊,從而引發計算冗餘。本文提出置換區塊稀疏注意力(PBS-Attn),這是一種即插即用的方法,利用注意力的置換特性提升區塊級稀疏度,進而增強LLM預填充階段的計算效率。我們在具挑戰性的真實長上下文數據集上進行全面實驗,結果表明PBS-Attn在模型準確度上持續優於現有區塊稀疏注意力方法,並與完整注意力基準線表現相當。透過我們自研的置換FlashAttention核心驅動,PBS-Attn在長上下文預填充中實現端到端最高2.75倍的加速,證實其實際可行性。程式碼已開源於:https://github.com/xinghaow99/pbs-attn
English
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose O(N^2) complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (PBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to 2.75times in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn
PDF241December 17, 2025