稀疏注意力机制:扩散语言模型中的稀疏注意力应用
SparseD: Sparse Attention for Diffusion Language Models
September 28, 2025
作者: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang
cs.AI
摘要
儘管擴散語言模型(DLMs)為自迴歸模型(ARs)提供了一種有前景的替代方案,現有的開源DLMs卻面臨著高推理延遲的瓶頸。這一瓶頸主要源於注意力機制在計算所有查詢-鍵對時,其複雜度與上下文長度呈二次方關係。直觀上,為降低此複雜度,一種自然策略是將注意力限制於僅保留最相關連接的稀疏模式上。此類方法在ARs中已得到廣泛應用,其中注意力遵循固定且明確定義的稀疏模式。然而,在DLMs中,我們觀察到不同的稀疏行為:(1)注意力模式在不同頭部間存在差異,(2)每個頭部的注意力模式在去噪步驟間保持高度相似,以及(3)早期去噪步驟對生成至關重要。這些發現使得為ARs設計的稀疏注意力方法在很大程度上與DLMs不兼容,因為它們未能捕捉到頭部特定的結構,且在應用於早期去噪步驟時可能導致生成質量下降。為應對這些挑戰,我們提出了SparseD,一種專為DLMs設計的新穎稀疏注意力方法。基於上述觀察,SparseD僅需一次性預計算頭部特定的稀疏模式,並在所有步驟中重複使用,從而避免了在每個去噪步驟重新計算稀疏模式。同時,SparseD在早期步驟使用全注意力,隨後切換至稀疏注意力以保持生成質量。這些特點共同使SparseD成為在長上下文應用中部署DLMs的實用且高效的解決方案。實驗結果表明,SparseD實現了無損加速,在64k上下文長度和1,024個去噪步驟的條件下,相較於FlashAttention,速度提升最高可達1.50倍。
English
While diffusion language models (DLMs) offer a promising alternative to
autoregressive models (ARs), existing open-source DLMs suffer from high
inference latency. This bottleneck is mainly due to the attention's quadratic
complexity with respect to context length in computing all query-key pairs.
Intuitively, to reduce this complexity, a natural strategy is to restrict
attention to sparse patterns that retain only the most relevant connections.
Such approaches are well-established in ARs, where attention follows fixed and
clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity
behaviors: (1) attention patterns vary across heads, (2) attention patterns in
each head remain highly similar across denoising steps, and (3) early denoising
steps are critical for generation. These findings render sparse attention
methods designed for ARs largely incompatible with DLMs, as they fail to
capture head-specific structures and risk degrading generation when applied in
early denoising steps. To address these challenges, we propose SparseD, a novel
sparse attention method for DLMs. Leveraging the observations, SparseD only
requires pre-computing head-specific sparse patterns one time, and reuses them
across all steps. This prevents recomputing sparse patterns at each denoising
step. Meanwhile, SparseD uses full attention in the early steps, then switches
to sparse attention later to maintain generation quality. Together, these
establish SparseD as a practical and efficient solution for deploying DLMs in
long-context applications. Experimental results demonstrate that SparseD
achieves lossless acceleration, delivering up to 1.50times speedup over
FlashAttention at a 64k context length with 1,024 denoising steps.