SLA2: Sparse-Lineaire Attention met Leerbaar Routeren en QAT

Samenvatting

Sparse-Linear Attention (SLA) combineert sparse en lineaire aandacht om diffusiemodellen te versnellen en heeft sterke prestaties getoond in videogeneratie. Echter, (i) SLA vertrouwt op een heuristische splitsing die berekeningen toewijst aan de sparse of lineaire tak op basis van de grootte van de aandachtswaarden, wat suboptimaal kan zijn. Daarnaast identificeren we (ii) na een formele analyse van de aandachtfout in SLA een mismatch tussen SLA en een directe ontbinding in sparse en lineaire aandacht. Wij stellen SLA² voor, dat (I) een leerbare router introduceert die dynamisch selecteert of elke aandachtberekening sparse of lineaire aandacht moet gebruiken, (II) een meer getrouwde en directe sparse-lineaire aandachtformulering die een leerbare ratio gebruikt om de sparse en lineaire takken te combineren, en (III) een sparse + low-bit aandachtontwerp, waarbij low-bit aandacht wordt geïntroduceerd via quantization-aware fine-tuning om de kwantiseringsfout te verminderen. Experimenten tonen aan dat SLA² bij videodiffusiemodellen 97% attentiesparsheid kan bereiken en een 18,6x versnelling van de aandacht levert, waarbij de generatiekwaliteit behouden blijft.

English

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2: Sparse-Lineaire Attention met Leerbaar Routeren en QAT

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Samenvatting

Support