SparseD:扩散语言模型中的稀疏注意力机制
SparseD: Sparse Attention for Diffusion Language Models
September 28, 2025
作者: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang
cs.AI
摘要
尽管扩散语言模型(DLMs)为自回归模型(ARs)提供了一个有前景的替代方案,但现有的开源DLMs面临高推理延迟的瓶颈。这一瓶颈主要源于注意力机制在计算所有查询-键对时相对于上下文长度的二次复杂度。直观上,降低这一复杂度的自然策略是限制注意力于稀疏模式,仅保留最相关的连接。此类方法在ARs中已得到广泛应用,其中注意力遵循固定且定义明确的稀疏模式。然而,在DLMs中,我们观察到独特的稀疏行为:(1)注意力模式在不同头部间变化;(2)每个头部的注意力模式在去噪步骤间保持高度相似;(3)早期去噪步骤对生成至关重要。这些发现使得为ARs设计的稀疏注意力方法大多不适用于DLMs,因为它们未能捕捉头部特定的结构,且在应用于早期去噪步骤时可能损害生成质量。针对这些挑战,我们提出了SparseD,一种专为DLMs设计的新型稀疏注意力方法。基于上述观察,SparseD仅需预先计算一次头部特定的稀疏模式,并在所有步骤中重复使用,避免了在每一步去噪时重新计算稀疏模式。同时,SparseD在早期步骤采用全注意力,随后切换至稀疏注意力以保持生成质量。这些特性共同确立了SparseD作为在长上下文应用中部署DLMs的实用且高效的解决方案。实验结果表明,SparseD实现了无损加速,在64k上下文长度和1,024去噪步骤下,相比FlashAttention最高可提速1.50倍。
English
While diffusion language models (DLMs) offer a promising alternative to
autoregressive models (ARs), existing open-source DLMs suffer from high
inference latency. This bottleneck is mainly due to the attention's quadratic
complexity with respect to context length in computing all query-key pairs.
Intuitively, to reduce this complexity, a natural strategy is to restrict
attention to sparse patterns that retain only the most relevant connections.
Such approaches are well-established in ARs, where attention follows fixed and
clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity
behaviors: (1) attention patterns vary across heads, (2) attention patterns in
each head remain highly similar across denoising steps, and (3) early denoising
steps are critical for generation. These findings render sparse attention
methods designed for ARs largely incompatible with DLMs, as they fail to
capture head-specific structures and risk degrading generation when applied in
early denoising steps. To address these challenges, we propose SparseD, a novel
sparse attention method for DLMs. Leveraging the observations, SparseD only
requires pre-computing head-specific sparse patterns one time, and reuses them
across all steps. This prevents recomputing sparse patterns at each denoising
step. Meanwhile, SparseD uses full attention in the early steps, then switches
to sparse attention later to maintain generation quality. Together, these
establish SparseD as a practical and efficient solution for deploying DLMs in
long-context applications. Experimental results demonstrate that SparseD
achieves lossless acceleration, delivering up to 1.50times speedup over
FlashAttention at a 64k context length with 1,024 denoising steps.