SparseD: 확산 언어 모델을 위한 희소 주의 메커니즘

초록

확산 언어 모델(DLMs)은 자기회귀 모델(ARs)에 대한 유망한 대안을 제공하지만, 기존의 오픈소스 DLMs은 높은 추론 지연 시간을 겪고 있습니다. 이러한 병목 현상은 주로 컨텍스트 길이에 대한 어텐션의 이차 복잡성, 즉 모든 쿼리-키 쌍을 계산하는 데서 기인합니다. 직관적으로 이 복잡성을 줄이기 위한 자연스러운 전략은 가장 관련성이 높은 연결만을 유지하는 희소 패턴으로 어텐션을 제한하는 것입니다. 이러한 접근 방식은 ARs에서 잘 정립되어 있으며, 어텐션이 고정적이고 명확하게 정의된 희소 패턴을 따릅니다. 그러나 DLMs에서는 다음과 같은 독특한 희소성 행동을 관찰했습니다: (1) 어텐션 패턴이 헤드마다 다르며, (2) 각 헤드의 어텐션 패턴이 디노이징 단계 전반에 걸쳐 매우 유사하게 유지되고, (3) 초기 디노이징 단계가 생성에 있어 중요합니다. 이러한 발견들은 ARs를 위해 설계된 희소 어텐션 방법들이 DLMs와 크게 호환되지 않게 만듭니다. 이 방법들은 헤드별 구조를 포착하지 못하고 초기 디노이징 단계에서 적용될 경우 생성 품질을 저하시킬 위험이 있습니다. 이러한 문제를 해결하기 위해, 우리는 DLMs를 위한 새로운 희소 어텐션 방법인 SparseD를 제안합니다. SparseD는 관찰된 사실을 활용하여 헤드별 희소 패턴을 한 번만 미리 계산하고 이를 모든 단계에서 재사용합니다. 이는 각 디노이징 단계에서 희소 패턴을 재계산하는 것을 방지합니다. 동시에, SparseD는 초기 단계에서는 전체 어텐션을 사용하고 이후 단계에서 희소 어텐션으로 전환하여 생성 품질을 유지합니다. 이를 통해 SparseD는 장문 컨텍스트 애플리케이션에서 DLMs를 배포하기 위한 실용적이고 효율적인 솔루션으로 자리 잡습니다. 실험 결과는 SparseD가 무손실 가속을 달성하며, 64k 컨텍스트 길이와 1,024 디노이징 단계에서 FlashAttention 대비 최대 1.50배의 속도 향상을 보여줍니다.

English

While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to 1.50times speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

SparseD: 확산 언어 모델을 위한 희소 주의 메커니즘

SparseD: Sparse Attention for Diffusion Language Models

초록

Support