Sparse VideoGen2:透過語義感知排列的稀疏注意力加速視頻生成
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
May 24, 2025
作者: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
cs.AI
摘要
擴散變換器(DiTs)在視頻生成中至關重要,但由於注意力機制的二次方複雜性,存在顯著的延遲問題。通過僅計算關鍵令牌,稀疏注意力降低了計算成本,並提供了一種有前景的加速方法。然而,我們發現現有方法在相同計算預算下無法達到最佳生成質量,原因有二:(1)關鍵令牌識別不準確:當前方法基於位置而非語義對令牌進行聚類,導致聚合表示不精確。(2)過多的計算浪費:關鍵令牌分散在非關鍵令牌中,導致在GPU上進行了浪費的計算,而GPU是為處理連續令牌而優化的。本文中,我們提出了SVG2,這是一個無需訓練的框架,旨在最大化識別準確性並最小化計算浪費,實現生成質量與效率之間的帕累托前沿權衡。SVG2的核心是語義感知排列,它使用k-means基於語義相似性對令牌進行聚類和重新排序。這種方法確保了精確的聚類表示,提高了識別準確性,並使關鍵令牌的佈局更加密集,從而實現無需填充的高效計算。此外,SVG2集成了top-p動態預算控制和定製內核實現,在保持HunyuanVideo和Wan 2.1上PSNR分別高達30和26的同時,實現了最高2.30倍和1.89倍的加速。
English
Diffusion Transformers (DiTs) are essential for video generation but suffer
from significant latency due to the quadratic complexity of attention. By
computing only critical tokens, sparse attention reduces computational costs
and offers a promising acceleration approach. However, we identify that
existing methods fail to approach optimal generation quality under the same
computation budget for two reasons: (1) Inaccurate critical token
identification: current methods cluster tokens based on position rather than
semantics, leading to imprecise aggregated representations. (2) Excessive
computation waste: critical tokens are scattered among non-critical ones,
leading to wasted computation on GPUs, which are optimized for processing
contiguous tokens. In this paper, we propose SVG2, a training-free framework
that maximizes identification accuracy and minimizes computation waste,
achieving a Pareto frontier trade-off between generation quality and
efficiency. The core of SVG2 is semantic-aware permutation, which clusters and
reorders tokens based on semantic similarity using k-means. This approach
ensures both a precise cluster representation, improving identification
accuracy, and a densified layout of critical tokens, enabling efficient
computation without padding. Additionally, SVG2 integrates top-p dynamic budget
control and customized kernel implementations, achieving up to 2.30x and 1.89x
speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan
2.1, respectively.Summary
AI-Generated Summary