LVSA: トレーニング不要のスパースアテンションによる長編動画拡散

要旨

密な自己注意は、長動画拡散推論における計算と品質のボトルネックであり、コストはシーケンス長の二乗で増加し、訓練範囲を超えるとモデルはほぼ静的な出力、すなわち「凍結された」反復動画に収束する。最先端の手法は、再訓練が必要となるなどコストが高すぎるか、あるいは性能と品質の両方の目標をスケーラブルに満たすことができない。この目的のために、我々はLong Video Sparse Attention (LVSA)を導入する。これは、ビデオ拡散トランスフォーマーのための訓練不要でモデル非依存のブロックスパース注意であり、構造化ウィンドウパターンと回転大域アンカーを組み合わせることで、長距離時間的アーティファクトを引き起こす固定グリッドバイアスを除去する。LVSAはFlashInferカーネルと組み合わせることで、密な注意と比較して、Wan 2.1 1.3Bでは6倍のホライゾンで最大3.17倍、Wan 2.1 14Bでは6倍のホライゾンで2.98倍、HunyuanVideo 1.5では1.5倍のホライゾンで3.33倍の計算量削減を実現する。計算量削減に加えて、LVSAはHunyuanVideo 1.5の2倍のホライゾンでの生成を可能にする。これは通常、単一GPUではメモリ不足となる。さらに、LVSAはWan 2.1 1.3Bにおいて、RIFLExと比較して最大2.41倍、UltraViCoと比較して最大3.27倍の高速化を提供する。多様なプラットフォームへの適用可能性を示すため、NPU上でLVSAを適用し、密な注意と比較してWan 2.2 A14Bで最大2.71倍、Wan 2.1 1.3Bで最大3.24倍の高速化を達成した。品質を公平に評価するために、我々はVQevalを導入する。これはループ動画の欠陥を適切にスコアリングするツールであり、一方でVBench-Longのような最先端の評価器ではこれらの欠陥が報酬を与えられてしまう。LVSAは、訓練ホライゾン長での生成に対しては品質に影響せず、拡張された長さでは品質を向上させる。

English

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.