Sparse-vDiT: Het potentieel van Sparse Attention benutten om Video Diffusion Transformers te versnellen

Samenvatting

Hoewel Diffusion Transformers (DiTs) doorbraken hebben bereikt in videogeneratie, blijft deze taak van lange sequentiegeneratie beperkt door de kwadratische complexiteit van aandachtmechanismen, wat resulteert in aanzienlijke inferentielatentie. Door gedetailleerde analyse van aandachtkaarten in Video Diffusion Transformer (vDiT) identificeren we drie terugkerende sparsitypatronen: diagonale, multi-diagonale en verticale streepstructuren. Zelfs 3-6% van de aandachtskoppen kan worden overgeslagen. Cruciaal is dat deze patronen sterke correlaties vertonen met laagdiepte en hoofdpositie, maar beperkte afhankelijkheid van de invoerinhoud tonen. Gebruikmakend van deze bevindingen, stellen we Sparse-vDiT voor, een sparsityversnellingsframework voor vDiT bestaande uit: 1) Patroon-geoptimaliseerde sparse kernels die dichte aandacht vervangen door computationeel efficiënte implementaties voor elk geïdentificeerd sparsitypatroon. 2) Een offline sparse diffusiezoekalgoritme dat de optimale sparse rekentrategie per laag en hoofd selecteert via hardwarebewuste kostenmodellering. Na het bepalen van de optimale configuratie, fuseren we koppen binnen dezelfde laag die dezelfde aandachtstrategie delen, wat de inferentie-efficiëntie verbetert. Geïntegreerd in state-of-the-art vDiT-modellen (CogVideoX1.5, HunyuanVideo en Wan2.1), bereikt Sparse-vDiT een theoretische FLOP-reductie van respectievelijk 2,09x, 2,38x en 1,67x, en daadwerkelijke inferentieversnellingen van 1,76x, 1,85x en 1,58x, terwijl een hoge visuele kwaliteit behouden blijft, met PSNR-waarden van 24,13, 27,09 en 22,59. Ons werk toont aan dat latente structurele sparsity in vDiTs systematisch kan worden benut voor lange videosynthese.

English

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09times, 2.38times, and 1.67times theoretical FLOP reduction, and actual inference speedups of 1.76times, 1.85times, and 1.58times, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

Sparse-vDiT: Het potentieel van Sparse Attention benutten om Video Diffusion Transformers te versnellen

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

Samenvatting

Support