Sparse VideoGen2：透過語義感知排列的稀疏注意力加速視頻生成

摘要

擴散變換器（DiTs）在視頻生成中至關重要，但由於注意力機制的二次方複雜性，存在顯著的延遲問題。通過僅計算關鍵令牌，稀疏注意力降低了計算成本，並提供了一種有前景的加速方法。然而，我們發現現有方法在相同計算預算下無法達到最佳生成質量，原因有二：（1）關鍵令牌識別不準確：當前方法基於位置而非語義對令牌進行聚類，導致聚合表示不精確。（2）過多的計算浪費：關鍵令牌分散在非關鍵令牌中，導致在GPU上進行了浪費的計算，而GPU是為處理連續令牌而優化的。本文中，我們提出了SVG2，這是一個無需訓練的框架，旨在最大化識別準確性並最小化計算浪費，實現生成質量與效率之間的帕累托前沿權衡。SVG2的核心是語義感知排列，它使用k-means基於語義相似性對令牌進行聚類和重新排序。這種方法確保了精確的聚類表示，提高了識別準確性，並使關鍵令牌的佈局更加密集，從而實現無需填充的高效計算。此外，SVG2集成了top-p動態預算控制和定製內核實現，在保持HunyuanVideo和Wan 2.1上PSNR分別高達30和26的同時，實現了最高2.30倍和1.89倍的加速。

English

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.

Sparse VideoGen2：透過語義感知排列的稀疏注意力加速視頻生成

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

摘要

Support