SLA2: 학습 가능한 라우팅 및 QAT를 적용한 희소 선형 어텐션

초록

희소-선형 어텐션(SLA)은 확산 모델의 가속화를 위해 희소 어텐션과 선형 어텐션을 결합한 방식으로, 비디오 생성에서 강력한 성능을 보여왔습니다. 그러나 (i) SLA는 어텐션 가중치 크기에 따라 계산을 희소 또는 선형 브랜치에 할당하는 휴리스틱 분할 방식에 의존하기 때문에 최적이 아닐 수 있습니다. 또한 (ii) SLA의 어텐션 오류를 정식으로 분석한 결과, SLA가 희소 어텐션과 선형 어텐션으로의 직접적인 분해와 불일치함을 확인했습니다. 우리는 SLA²를 제안하며, 여기에는 (I) 각 어텐션 계산이 희소 어텐션과 선형 어텐션 중 어느 것을 사용할지를 동적으로 선택하는 학습 가능한 라우터, (II) 학습 가능한 비율을 사용하여 희소 어텐션 브랜치와 선형 어텐션 브랜치를 결합하는 더 정확하고 직접적인 희소-선형 어텐션 공식, (III) 양자화 인지 미세 조정을 통해 양자화 오류를 줄이기 위해 저비트 어텐션을 도입한 희소 + 저비트 어텐션 설계가 포함됩니다. 실험 결과, 비디오 확산 모델에서 SLA²는 97%의 어텐션 희소성을 달성하고 생성 품질을 유지하면서 어텐션 속도를 18.6배 향상시킬 수 있음을 보여줍니다.

English

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2: 학습 가능한 라우팅 및 QAT를 적용한 희소 선형 어텐션

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

초록

Support