Sparse VideoGen2: 시맨틱 인지 순열을 통한 희소 어텐션 기반 비디오 생성 가속화

초록

디퓨전 트랜스포머(DiTs)는 비디오 생성에 필수적이지만, 어텐션의 이차 복잡성으로 인해 상당한 지연 시간이 발생합니다. 중요한 토큰만 계산하는 희소 어텐션은 계산 비용을 줄이고 유망한 가속 접근 방식을 제공합니다. 그러나 기존 방법들은 동일한 계산 예산 내에서 최적의 생성 품질에 도달하지 못하는 두 가지 이유가 있습니다: (1) 부정확한 중요 토큰 식별: 현재 방법들은 위치가 아닌 의미를 기반으로 토큰을 클러스터링하여 부정확한 집계 표현을 초래합니다. (2) 과도한 계산 낭비: 중요 토큰이 비중요 토큰 사이에 흩어져 있어, 연속적인 토큰 처리에 최적화된 GPU에서 계산이 낭비됩니다. 본 논문에서는 생성 품질과 효율성 간의 파레토 최적화를 달성하기 위해 식별 정확도를 극대화하고 계산 낭비를 최소화하는 학습이 필요 없는 프레임워크인 SVG2를 제안합니다. SVG2의 핵심은 k-means를 사용하여 의미적 유사성을 기반으로 토큰을 클러스터링하고 재정렬하는 의미 인식 순열(semantic-aware permutation)입니다. 이 접근 방식은 정확한 클러스터 표현을 보장하여 식별 정확도를 향상시키고, 중요 토큰의 밀집된 레이아웃을 통해 패딩 없이 효율적인 계산을 가능하게 합니다. 또한, SVG2는 top-p 동적 예산 제어와 맞춤형 커널 구현을 통합하여 HunyuanVideo와 Wan 2.1에서 각각 최대 30과 26의 PSNR을 유지하면서 최대 2.30배 및 1.89배의 속도 향상을 달성합니다.

English

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.

Sparse VideoGen2: 시맨틱 인지 순열을 통한 희소 어텐션 기반 비디오 생성 가속화

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

초록

Support