Trainbaar Log-lineair Sparse Attention voor Efficiënte Diffusie Transformers

Samenvatting

Diffusion Transformers (DiTs) stellen de standaard in visuele generatie, maar hun kwadratische self-attention-kosten beperken fundamenteel de schaalbaarheid naar lange tokenreeksen. Recente Top-K sparse attention-benaderingen verminderen de rekenkosten van DiTs door tokens samen te persen tot bloksgewijze representaties en een kleine set relevante sleutelblokken te selecteren, maar lijden nog steeds onder (i) kwadratische selectiekosten op gecomprimeerde tokens en (ii) een toenemende K die nodig is om de modelkwaliteit te behouden naarmate reeksen groeien. Wij stellen vast dat hun inefficiëntie te wijten is aan het enkelniveau-ontwerp, aangezien een enkel grof niveau onvoldoende is om de globale structuur weer te geven. In dit artikel introduceren we Log-lineaire Sparse Attention (LLSA), een trainbaar sparse attention-mechanisme voor extreem lange tokenreeksen dat zowel de selectie- als aandachtskosten reduceert van kwadratisch naar log-lineaire complexiteit door gebruik te maken van een hiërarchische structuur. LLSA voert hiërarchische Top-K-selectie uit, waarbij stapsgewijs sparse Top-K-selectie wordt toegepast met de indices die op het vorige niveau zijn gevonden, en introduceert een Hiërarchisch KV-verrijkingsmechanisme dat de globale context behoudt terwijl er minder tokens van verschillende granulariteit worden gebruikt tijdens de aandachtberekening. Om efficiënte training te ondersteunen, ontwikkelen we een hoogwaardige GPU-implementatie die alleen sparse indices gebruikt voor zowel de voorwaartse als achterwaartse passes, waardoor de noodzaak van dichte aandachtmaskers wordt geëlimineerd. We evalueren LLSA op beeldgeneratie in de pixelruimte met hoge resolutie zonder gebruik te maken van patchificatie en VAE-codering. LLSA versnelt aandachtinferentie met 28.27x en DiT-training met 6.09x op 256x256 pixel-tokenreeksen, waarbij de generatiekwaliteit behouden blijft. De resultaten tonen aan dat LLSA een veelbelovende richting biedt voor het efficiënt trainen van lange-reeks DiTs. Code is beschikbaar op: https://github.com/SingleZombie/LLSA

English

Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

Trainbaar Log-lineair Sparse Attention voor Efficiënte Diffusie Transformers

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Samenvatting

Support