SLA2 : Attention Linéaire Creuse avec Routage Apprenant et QAT

Résumé

L'attention sparse-linéaire (SLA) combine les mécanismes d'attention sparse et linéaire pour accélérer les modèles de diffusion et a démontré des performances solides en génération vidéo. Cependant, (i) SLA repose sur une division heuristique qui affecte les calculs à la branche sparse ou linéaire en fonction de l'amplitude des poids d'attention, ce qui peut être sous-optimal. De plus, (ii) après une analyse formelle de l'erreur d'attention dans SLA, nous identifions une inadéquation entre SLA et une décomposition directe en attention sparse et linéaire. Nous proposons SLA², qui introduit (I) un routeur apprenable qui sélectionne dynamiquement si chaque calcul d'attention doit utiliser l'attention sparse ou linéaire, (II) une formulation sparse-linéaire plus fidèle et directe utilisant un ratio apprenable pour combiner les branches d'attention sparse et linéaire, et (III) une conception d'attention sparse + basse précision, où l'attention basse précision est introduite via un fine-tuning sensible à la quantification pour réduire l'erreur de quantification. Les expériences montrent que sur les modèles de diffusion vidéo, SLA² peut atteindre 97 % de sparsité d'attention et offrir une accélération de l'attention de 18,6× tout en préservant la qualité de génération.

English

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2 : Attention Linéaire Creuse avec Routage Apprenant et QAT

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Résumé

Support