LVSA: 긴 비디오 확산을 위한 훈련 없는 희소 어텐션

초록

밀집 자기 주의는 장기 비디오 확산 추론의 계산 및 품질 병목 현상입니다. 즉, 시퀀스 길이가 증가함에 따라 계산 비용이 제곱으로 증가하며, 훈련 시퀀스 길이를 초과하면 모델이 정적 출력, 즉 "고정된" 반복 비디오로 수렴합니다. 최신 접근법은 재훈련이 필요하는 등 비용이 너무 많이 들거나, 성능과 품질 목표를 확장 가능한 방식으로 모두 충족하지 못합니다. 이러한 문제를 해결하기 위해, 본 논문에서는 비디오 확산 트랜스포머를 위한 훈련 불필요 모델 비의존적 블록 희소 주의 기법인 Long Video Sparse Attention (LVSA)을 제안합니다. LVSA는 구조화된 윈도우 패턴과 회전 글로벌 앵커를 결합하여 장기 시간적 인공물을 유발하는 고정 그리드 편향을 제거합니다. LVSA는 FlashInfer 커널과 결합하여, 밀집 주의 대비 Wan 2.1 1.3B에서 6배 시퀀스 길이 기준 최대 3.17배, Wan 2.1 14B에서 6배 시퀀스 길이 기준 2.98배, HunyuanVideo 1.5에서 1.5배 시퀀스 길이 기준 3.33배의 계산량 감소를 달성합니다. 계산량 감소 외에도 LVSA는 단일 GPU에서 메모리 부족으로 생성이 불가능했던 HunyuanVideo 1.5의 2배 시퀀스 길이 생성을 가능하게 합니다. 또한 Wan 2.1 1.3B에서 RIFLEx 대비 최대 2.41배, UltraViCo 대비 3.27배의 속도 향상을 제공합니다. 다양한 플랫폼에서의 적용 가능성을 입증하기 위해 NPU에 LVSA를 적용한 결과, 밀집 주의 대비 Wan 2.2 A14B에서 최대 2.71배, Wan 2.1 1.3B에서 최대 3.24배의 속도 향상을 달성했습니다. 공정한 품질 평가를 위해, 반복 비디오 실패를 적절히 평가하는 도구인 VQeval을 도입합니다. 이러한 실패는 기존 최신 평가기인 VBench-Long에서는 오히려 높은 점수를 받습니다. LVSA는 훈련 시퀀스 길이에서의 생성에 대해 품질 중립적이며, 확장된 시퀀스 길이에서는 품질에 긍정적입니다.

English

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.