可训练的对数线性稀疏注意力机制在高效扩散变换器中的应用
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
December 18, 2025
作者: Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan
cs.AI
摘要
扩散变换器(DiTs)在视觉生成领域确立了最先进水平,但其二次方的自注意力计算成本从根本上限制了长令牌序列的扩展。近期Top-K稀疏注意力方法通过将令牌压缩为块状表示并选择少量相关关键块来减少DiTs计算量,但仍存在两大缺陷:(一)压缩令牌上的二次方选择成本;(二)随着序列增长,维持模型质量所需K值持续增加。我们发现其低效性源于单层级设计,因为单一粗粒度层级难以有效表征全局结构。本文提出对数线性稀疏注意力(LLSA),这是一种可训练的稀疏注意力机制,通过利用层次化结构将选择和注意力成本从二次方降至对数线性复杂度,适用于极长令牌序列。LLSA执行分层Top-K选择,基于前一层级发现的索引逐步采用稀疏Top-K选择,并引入分层键值增强机制,在注意力计算过程中使用更少不同粒度的令牌即可保持全局上下文。为支持高效训练,我们开发了高性能GPU实现方案,在前向和反向传播中仅使用稀疏索引,无需稠密注意力掩码。我们在不使用分块化和VAE编码的高分辨率像素空间图像生成任务上评估LLSA。在256x256像素令牌序列上,LLSA将注意力推理速度提升28.27倍,DiT训练速度提升6.09倍,同时保持生成质量。结果表明LLSA为高效训练长序列DiTs提供了可行路径。代码已开源:https://github.com/SingleZombie/LLSA
English
Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA