稀疏视频扩散变换器:释放稀疏注意力潜能,加速视频生成
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
June 3, 2025
作者: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen
cs.AI
摘要
尽管扩散变换器(DiTs)在视频生成领域取得了突破性进展,但这一长序列生成任务仍受限于注意力机制的二次方复杂度,导致显著的推理延迟。通过对视频扩散变换器(vDiT)中注意力图的深入分析,我们识别出三种反复出现的稀疏模式:对角线、多对角线和垂直条纹结构。甚至3-6%的注意力头可以被跳过。关键的是,这些模式展现出强烈的层深度与头位置相关性,但对输入内容的依赖性有限。基于这些发现,我们提出了Sparse-vDiT,一个针对vDiT的稀疏加速框架,包含:1)模式优化的稀疏核,用计算高效的实现替换了密集注意力,适用于每种已识别的稀疏模式;2)离线稀疏扩散搜索算法,通过硬件感知的成本建模,为每层和每个头选择最优的稀疏计算策略。确定最优配置后,我们将同一层内采用相同注意力策略的头进行融合,以提升推理效率。集成至最先进的vDiT模型(CogVideoX1.5、HunyuanVideo和Wan2.1)中,Sparse-vDiT分别实现了2.09倍、2.38倍和1.67倍的理论FLOP减少,以及1.76倍、1.85倍和1.58倍的实际推理加速,同时保持了高视觉保真度,PSNR值分别达到24.13、27.09和22.59。我们的工作表明,vDiT中的潜在结构稀疏性可被系统性地利用于长视频合成。
English
While Diffusion Transformers (DiTs) have achieved breakthroughs in video
generation, this long sequence generation task remains constrained by the
quadratic complexity of attention mechanisms, resulting in significant
inference latency. Through detailed analysis of attention maps in Video
Diffusion Transformer (vDiT), we identify three recurring sparsity patterns:
diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\%
attention heads can be skipped. Crucially, these patterns exhibit strong
layer-depth and head-position correlations but show limited dependence on the
input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity
acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels
that replace dense attention with computationally efficient implementations for
each identified sparsity pattern. 2) An offline sparse diffusion search
algorithm that selects the optimal sparse computation strategy per layer and
head via hardware-aware cost modeling. After determining the optimal
configuration, we fuse heads within the same layer that share the same
attention strategy, enhancing inference efficiency. Integrated into
state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1),
Sparse-vDiT achieves 2.09times, 2.38times, and 1.67times theoretical
FLOP reduction, and actual inference speedups of 1.76times, 1.85times,
and 1.58times, respectively, while maintaining high visual fidelity, with
PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent
structural sparsity in vDiTs can be systematically exploited for long video
synthesis.