PISA: Stuksgewijze Sparse Attention Is Slimmer voor Efficiënte Diffusie-Transformers

Samenvatting

Diffusion Transformers zijn fundamenteel voor video- en beeldgeneratie, maar hun efficiëntie wordt beperkt door de kwadratische complexiteit van attention. Hoewel block sparse attention de berekening versnelt door alleen kritieke key-value blokken te verwerken, leidt dit bij hoge sparseheid tot kwaliteitsverlies door het wegvallen van context. In dit werk ontdekken we dat attention-scores van niet-kritieke blokken distributionele stabiliteit vertonen, waardoor ze nauwkeurig en efficiënt kunnen worden benaderd in plaats van verwijderd – een cruciaal inzicht voor sparse attention-ontwerp. Geïnspireerd door dit inzicht presenteren we PISA, een trainingsvrije Piecewise Sparse Attention die de volledige attention-span dekt met subkwadratische complexiteit. In tegenstelling tot het conventionele keep-or-drop paradigma dat niet-kritieke blokinformatie direct verwerpt, introduceert PISA een exact-or-approximate strategie: het behoudt exacte berekening voor kritieke blokken terwijl de rest efficiënt wordt benaderd via bloksgewijze Taylor-expansie. Dit ontwerp stelt PISA in staat als nauwkeurige proxy voor volledige attention te fungeren, waardoor de kloof tussen snelheid en kwaliteit wordt overbrugd. Experimentele resultaten tonen aan dat PISA respectievelijk 1,91× en 2,57× versnelling bereikt op Wan2.1-14B en Hunyuan-Video, terwijl het consistent de hoogste kwaliteit onder sparse attention-methoden handhaaft. Opmerkelijk is dat PISA zelfs voor beeldgeneratie op FLUX een 1,2× versnelling bereikt zonder in te leveren op visuele kwaliteit. Code is beschikbaar op: https://github.com/xie-lab-ml/piecewise-sparse-attention.

English

Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.

PISA: Stuksgewijze Sparse Attention Is Slimmer voor Efficiënte Diffusie-Transformers

PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Samenvatting

Support