SparseD: 拡散言語モデルのためのスパースアテンション

要旨

拡散言語モデル（DLM）は自己回帰モデル（AR）に対する有望な代替手段を提供するが、既存のオープンソースのDLMは高い推論遅延に悩まされている。このボトルネックは主に、すべてのクエリ-キーペアを計算する際の注意機構の文脈長に対する二次的な複雑さに起因している。直感的に、この複雑さを軽減するための自然な戦略は、最も関連性の高い接続のみを保持する疎なパターンに注意を制限することである。このようなアプローチはARでは確立されており、注意は固定された明確に定義された疎なパターンに従う。しかし、DLMでは異なる疎性の挙動が観察される：（1）注意パターンはヘッド間で異なり、（2）各ヘッドの注意パターンはノイズ除去ステップ間で非常に類似しており、（3）初期のノイズ除去ステップが生成において重要である。これらの発見は、AR向けに設計された疎な注意手法がDLMとほとんど互換性がないことを示しており、ヘッド固有の構造を捉えられず、初期のノイズ除去ステップで適用すると生成品質が低下するリスクがある。これらの課題に対処するため、我々はDLM向けの新しい疎な注意手法であるSparseDを提案する。観察結果を活用し、SparseDはヘッド固有の疎なパターンを一度だけ事前計算し、すべてのステップで再利用する。これにより、各ノイズ除去ステップで疎なパターンを再計算する必要がなくなる。同時に、SparseDは初期ステップでは完全な注意を使用し、後で疎な注意に切り替えて生成品質を維持する。これらを組み合わせることで、SparseDは長文脈アプリケーションでのDLMの実用的で効率的なソリューションとして確立される。実験結果は、SparseDが損失のない加速を実現し、64kの文脈長と1,024のノイズ除去ステップにおいてFlashAttentionに対して最大1.50倍の高速化を達成することを示している。

English

While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to 1.50times speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

SparseD: 拡散言語モデルのためのスパースアテンション

SparseD: Sparse Attention for Diffusion Language Models

要旨

Support