LVSA：無需訓練的長視頻擴散稀疏注意力

摘要

密集自注意力是長影片擴散推論中的計算與品質瓶頸：其計算成本隨序列長度呈二次成長，且超出訓練時域長度時，模型會收斂至近乎靜態輸出，即「僵化」的重複影片。當前最先進的方法若非成本過高（例如需要重新訓練），就是無法在可擴展的方式下同時滿足效能與品質目標。為此，我們提出長影片稀疏注意力（LVSA），一種無需訓練、與模型無關的塊稀疏注意力機制，適用於影片擴散變換器。該方法結合結構化窗口模式與旋轉全域錨點，從而消除導致長時域偽影的固定網格偏差。結合 FlashInfer 核心後，在 Wan 2.1 1.3B 模型的 6 倍時域長度下，LVSA 相較於密集注意力可降低計算量達 3.17 倍；在 Wan 2.1 14B 模型的 6 倍時域長度下達 2.98 倍；在 HunyuanVideo 1.5 模型的 1.5 倍時域長度下達 3.33 倍。除了降低計算量，LVSA 還能在單一 GPU 上實現 HunyuanVideo 1.5 的 2 倍時域長度生成（否則會因記憶體不足而無法執行）。此外，在 Wan 2.1 1.3B 模型上，LVSA 相較於 RIFLEx 可提供最高 2.41 倍的加速，相較於 UltraViCo 則可提供最高 3.27 倍的加速。為證明其在多樣化平台上的適用性，我們將 LVSA 應用於神經處理器（NPU），相較於密集注意力，在 Wan 2.2 A14B 上可獲得最高 2.71 倍的加速，在 Wan 2.1 1.3B 上則可獲得最高 3.24 倍的加速。為公平評估品質，我們引入 VQeval 工具，該工具能正確評分循環影片的失敗案例（這類失敗在 VBench-Long 等當前最新評估工具中反而會獲得獎勵）。LVSA 在訓練時域長度下的生成中保持品質中性，而在延伸時域長度下則可提升品質。

English

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.