Sparse VideoGen2:通过语义感知重排序的稀疏注意力加速视频生成
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
May 24, 2025
作者: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
cs.AI
摘要
扩散变换器(DiTs)在视频生成中至关重要,但由于注意力机制的二次复杂度,存在显著的延迟问题。通过仅计算关键令牌,稀疏注意力降低了计算成本,提供了一种有前景的加速途径。然而,我们发现现有方法在相同计算预算下无法达到最优生成质量,原因有二:(1)关键令牌识别不准确:当前方法基于位置而非语义对令牌进行聚类,导致聚合表示不精确。(2)计算浪费过多:关键令牌分散在非关键令牌中,导致GPU在处理连续令牌时计算效率低下。本文提出SVG2,一种无需训练的框架,旨在最大化识别精度并最小化计算浪费,实现生成质量与效率之间的帕累托前沿权衡。SVG2的核心是语义感知排列,利用k-means基于语义相似性对令牌进行聚类和重排序。该方法既确保了精确的聚类表示,提高了识别精度,又实现了关键令牌的密集布局,无需填充即可高效计算。此外,SVG2集成了top-p动态预算控制和定制内核实现,在HunyuanVideo和Wan 2.1上分别实现了高达2.30倍和1.89倍的加速,同时保持了峰值信噪比(PSNR)分别达到30和26。
English
Diffusion Transformers (DiTs) are essential for video generation but suffer
from significant latency due to the quadratic complexity of attention. By
computing only critical tokens, sparse attention reduces computational costs
and offers a promising acceleration approach. However, we identify that
existing methods fail to approach optimal generation quality under the same
computation budget for two reasons: (1) Inaccurate critical token
identification: current methods cluster tokens based on position rather than
semantics, leading to imprecise aggregated representations. (2) Excessive
computation waste: critical tokens are scattered among non-critical ones,
leading to wasted computation on GPUs, which are optimized for processing
contiguous tokens. In this paper, we propose SVG2, a training-free framework
that maximizes identification accuracy and minimizes computation waste,
achieving a Pareto frontier trade-off between generation quality and
efficiency. The core of SVG2 is semantic-aware permutation, which clusters and
reorders tokens based on semantic similarity using k-means. This approach
ensures both a precise cluster representation, improving identification
accuracy, and a densified layout of critical tokens, enabling efficient
computation without padding. Additionally, SVG2 integrates top-p dynamic budget
control and customized kernel implementations, achieving up to 2.30x and 1.89x
speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan
2.1, respectively.