SLA:通过可微调稀疏线性注意力在扩散变换器中超越稀疏性
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
September 28, 2025
作者: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
cs.AI
摘要
在擴散變換器(DiT)模型中,尤其是針對視頻生成,由於序列長度較長且具有二次方複雜度,注意力延遲成為主要瓶頸。我們發現注意力權重可分為兩部分:一小部分具有高秩的大權重和其餘具有極低秩的權重。這自然建議對第一部分應用稀疏加速,對第二部分應用低秩加速。基於這一發現,我們提出了SLA(稀疏線性注意力),這是一種可訓練的注意力方法,融合了稀疏和線性注意力以加速擴散模型。SLA將注意力權重分類為關鍵、邊際和可忽略三類,對關鍵權重應用O(N^2)注意力,對邊際權重應用O(N)注意力,並跳過可忽略的權重。SLA將這些計算結合到單個GPU內核中,並支持前向和反向傳播。僅需使用SLA進行少量微調,DiT模型即可實現注意力計算的20倍減少,從而在不損失生成質量的情況下顯著加速。實驗表明,SLA在不降低端到端生成質量的情況下,將注意力計算減少了95%,優於基線方法。此外,我們為SLA實現了一個高效的GPU內核,在Wan2.1-1.3B上,注意力計算速度提升了13.7倍,視頻生成的端到端速度提升了2.2倍。
English
In Diffusion Transformer (DiT) models, particularly for video generation,
attention latency is a major bottleneck due to the long sequence length and the
quadratic complexity. We find that attention weights can be separated into two
parts: a small fraction of large weights with high rank and the remaining
weights with very low rank. This naturally suggests applying sparse
acceleration to the first part and low-rank acceleration to the second. Based
on this finding, we propose SLA (Sparse-Linear Attention), a trainable
attention method that fuses sparse and linear attention to accelerate diffusion
models. SLA classifies attention weights into critical, marginal, and
negligible categories, applying O(N^2) attention to critical weights, O(N)
attention to marginal weights, and skipping negligible ones. SLA combines these
computations into a single GPU kernel and supports both forward and backward
passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x
reduction in attention computation, resulting in significant acceleration
without loss of generation quality. Experiments show that SLA reduces attention
computation by 95% without degrading end-to-end generation quality,
outperforming baseline methods. In addition, we implement an efficient GPU
kernel for SLA, which yields a 13.7x speedup in attention computation and a
2.2x end-to-end speedup in video generation on Wan2.1-1.3B.