LVSA：无需训练的稀疏注意力用于长视频扩散

摘要

密集自注意力是长视频扩散推理的计算与质量瓶颈：其计算量随序列长度呈二次增长，且超出训练时域后模型会收敛至接近静态输出，即“冻结”的重复视频。现有顶尖方法要么成本过高（例如需要重新训练），要么无法同时以可扩展方式满足性能与质量目标。为此，我们提出长视频稀疏注意力（LVSA）——一种无需训练、与模型无关的块稀疏注意力机制，应用于视频扩散变换器。该方法结合结构化窗口模式与旋转全局锚点，消除了导致长程时间伪影的固定网格偏差。结合FlashInfer内核，LVSA在密集注意力的基础上，将Wan 2.1 1.3B模型的6倍时域计算量降低3.17倍，Wan 2.1 14B模型的6倍时域计算量降低2.98倍，HunyuanVideo 1.5模型的1.5倍时域计算量降低3.33倍。除减少计算量外，LVSA使得HunyuanVideo 1.5模型可在2倍时域下生成（否则单张GPU将内存不足）。此外，在Wan 2.1 1.3B模型上，LVSA相比RIFLEx加速最高达2.41倍，相比UltraViCo加速最高达3.27倍。为验证跨平台适用性，我们在NPU上应用LVSA，与密集注意力相比，Wan 2.2 A14B加速最高达2.71倍，Wan 2.1 1.3B加速最高达3.24倍。为实现公平的质量评估，我们引入VQeval工具——该工具能正确评判循环视频缺陷，而此类缺陷在VBench-Long等现有评估器中反会被奖励。LVSA在训练时域长度的生成中保持质量中性，在扩展时域中则呈现质量积极效果。

English

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.