Re-ttention: Ultra Sparse Visuele Generatie via Statistisch Herschikken van Aandacht

Samenvatting

Diffusion Transformers (DiT) zijn inmiddels het standaardmodel geworden voor het genereren van hoogwaardige visuele content zoals video's en afbeeldingen. Een groot knelpunt is het aandachtmechanisme, waarvan de complexiteit kwadratisch toeneemt met de resolutie en de lengte van de video. Een logische manier om deze belasting te verminderen is sparse attention, waarbij slechts een subset van tokens of patches wordt meegenomen in de berekening. Bestaande technieken slagen er echter niet in om de visuele kwaliteit te behouden bij extreem hoge sparsity-niveaus en kunnen zelfs aanzienlijke rekenkosten met zich meebrengen. % Om dit probleem aan te pakken, stellen we Re-ttention voor, dat zeer hoge sparse attention implementeert voor visuele generatiemodellen door gebruik te maken van de temporele redundantie van Diffusion Models om de probabilistische normalisatieverschuiving binnen het aandachtmechanisme te overwinnen. Specifiek hervormt Re-ttention de aandachtsscores op basis van de eerdere softmax-distributiegeschiedenis om de visuele kwaliteit van de volledige kwadratische aandacht te behouden bij zeer hoge sparsity-niveaus. % Experimentele resultaten op T2V/T2I-modellen zoals CogVideoX en de PixArt DiTs laten zien dat Re-ttention slechts 3,1\% van de tokens nodig heeft tijdens inferentie, wat beter presteert dan hedendaagse methoden zoals FastDiTAttn, Sparse VideoGen en MInference. Verder meten we de latentie om aan te tonen dat onze methode een end-to-end reductie van meer dan 45\% % en een reductie van meer dan 92\% in de latentie van self-attention kan bereiken op een H100 GPU tegen verwaarloosbare overheadkosten. Code is online beschikbaar hier: https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}

English

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}

Re-ttention: Ultra Sparse Visuele Generatie via Statistisch Herschikken van Aandacht

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Samenvatting

Support