Sparse-vDiT：釋放稀疏注意力的潛力以加速視頻擴散變換器

摘要

儘管擴散變換器（DiTs）在視頻生成領域取得了突破性進展，但這項長序列生成任務仍受到注意力機制二次方複雜度的限制，導致顯著的推理延遲。通過對視頻擴散變換器（vDiT）中注意力圖的詳細分析，我們識別出三種重複出現的稀疏模式：對角線、多對角線和垂直條紋結構。甚至3-6%的注意力頭可以被跳過。關鍵在於，這些模式展現出強烈的層深與頭位相關性，但對輸入內容的依賴性有限。基於這些發現，我們提出了Sparse-vDiT，這是一個針對vDiT的稀疏加速框架，包含：1）模式優化的稀疏核，用計算效率高的實現替換密集注意力，針對每種識別的稀疏模式。2）一種離線稀疏擴散搜索算法，通過硬件感知的成本模型，為每層和每個頭選擇最佳的稀疏計算策略。確定最佳配置後，我們將同一層中共享相同注意力策略的頭進行融合，提升推理效率。集成到最先進的vDiT模型（CogVideoX1.5、HunyuanVideo和Wan2.1）中，Sparse-vDiT分別實現了2.09倍、2.38倍和1.67倍的理論FLOP減少，以及實際推理速度提升1.76倍、1.85倍和1.58倍，同時保持高視覺保真度，PSNR值分別達到24.13、27.09和22.59。我們的工作表明，vDiT中的潛在結構稀疏性可以被系統性地利用於長視頻合成。

English

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09times, 2.38times, and 1.67times theoretical FLOP reduction, and actual inference speedups of 1.76times, 1.85times, and 1.58times, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

Sparse-vDiT：釋放稀疏注意力的潛力以加速視頻擴散變換器

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

摘要

Support