SLA：通过可微调稀疏线性注意力在扩散变换器中超越稀疏性

摘要

在擴散變換器（DiT）模型中，尤其是針對視頻生成，由於序列長度較長且具有二次方複雜度，注意力延遲成為主要瓶頸。我們發現注意力權重可分為兩部分：一小部分具有高秩的大權重和其餘具有極低秩的權重。這自然建議對第一部分應用稀疏加速，對第二部分應用低秩加速。基於這一發現，我們提出了SLA（稀疏線性注意力），這是一種可訓練的注意力方法，融合了稀疏和線性注意力以加速擴散模型。SLA將注意力權重分類為關鍵、邊際和可忽略三類，對關鍵權重應用O(N^2)注意力，對邊際權重應用O(N)注意力，並跳過可忽略的權重。SLA將這些計算結合到單個GPU內核中，並支持前向和反向傳播。僅需使用SLA進行少量微調，DiT模型即可實現注意力計算的20倍減少，從而在不損失生成質量的情況下顯著加速。實驗表明，SLA在不降低端到端生成質量的情況下，將注意力計算減少了95%，優於基線方法。此外，我們為SLA實現了一個高效的GPU內核，在Wan2.1-1.3B上，注意力計算速度提升了13.7倍，視頻生成的端到端速度提升了2.2倍。

English

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

SLA：通过可微调稀疏线性注意力在扩散变换器中超越稀疏性

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

摘要

Support