SLA: 희소성을 넘어서는 확산 트랜스포머 - 미세 조정 가능한 희소 선형 어텐션

초록

디퓨전 트랜스포머(DiT) 모델, 특히 비디오 생성에서 어텐션 지연은 긴 시퀀스 길이와 2차 복잡도로 인해 주요 병목 현상으로 작용합니다. 우리는 어텐션 가중치가 두 부분으로 분리될 수 있음을 발견했습니다: 높은 랭크를 가진 소수의 큰 가중치와 매우 낮은 랭크를 가진 나머지 가중치입니다. 이는 자연스럽게 첫 번째 부분에는 희소 가속을, 두 번째 부분에는 저랭크 가속을 적용하는 것을 제안합니다. 이러한 발견을 바탕으로, 우리는 SLA(Sparse-Linear Attention)를 제안합니다. SLA는 희소 어텐션과 선형 어텐션을 융합하여 디퓨전 모델을 가속화하는 학습 가능한 어텐션 방법입니다. SLA는 어텐션 가중치를 중요, 경계, 무시 가능한 범주로 분류하며, 중요 가중치에는 O(N^2) 어텐션을, 경계 가중치에는 O(N) 어텐션을 적용하고, 무시 가능한 가중치는 건너뜁니다. SLA는 이러한 계산을 단일 GPU 커널로 결합하며 순방향 및 역방향 패스를 모두 지원합니다. SLA를 사용하여 몇 번의 미세 조정만으로도 DiT 모델은 어텐션 계산을 20배 감소시켜 생성 품질의 손실 없이 상당한 가속을 달성합니다. 실험 결과, SLA는 종단 간 생성 품질을 저하시키지 않으면서 어텐션 계산을 95% 감소시키며, 기준 방법들을 능가하는 성능을 보여줍니다. 또한, 우리는 SLA를 위한 효율적인 GPU 커널을 구현하여 Wan2.1-1.3B에서 어텐션 계산에서 13.7배, 비디오 생성에서 종단 간 2.2배의 속도 향상을 달성했습니다.

English

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

SLA: 희소성을 넘어서는 확산 트랜스포머 - 미세 조정 가능한 희소 선형 어텐션

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

초록

Support