ChatPaper.aiChatPaper

SLA:通过可微调稀疏线性注意力机制,超越扩散Transformer中的稀疏性限制

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

September 28, 2025
作者: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
cs.AI

摘要

在扩散变换器(DiT)模型中,尤其是针对视频生成任务,由于序列长度较长以及二次方复杂度,注意力延迟成为主要瓶颈。我们发现,注意力权重可分为两部分:一小部分具有高秩的大权重和其余极低秩的权重。这自然启示我们对前者应用稀疏加速,对后者采用低秩加速。基于这一发现,我们提出了SLA(稀疏-线性注意力),一种可训练的注意力方法,它融合了稀疏与线性注意力以加速扩散模型。SLA将注意力权重划分为关键、边缘和可忽略三类,对关键权重应用O(N^2)的注意力计算,对边缘权重采用O(N)的注意力计算,并跳过可忽略的权重。SLA将这些计算整合到单一GPU内核中,并支持前向与反向传播。仅需使用SLA进行少量微调步骤,DiT模型即可实现注意力计算20倍的缩减,带来显著的加速效果而不损失生成质量。实验表明,SLA在不降低端到端生成质量的前提下,将注意力计算减少了95%,超越了基线方法。此外,我们为SLA实现了一个高效的GPU内核,在Wan2.1-1.3B上实现了注意力计算13.7倍的加速以及视频生成端到端2.2倍的提速。
English
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
PDF983September 30, 2025