VMoBA：面向视频扩散模型的混合块注意力机制

摘要

全注意力机制的二次复杂度对旨在生成长时长、高分辨率视频的视频扩散模型（VDMs）构成了显著瓶颈。尽管已提出多种稀疏注意力方法，但许多方法仅作为无需训练的推理加速器设计，或在原生训练时未能最优地捕捉视频数据中固有的独特时空特征。本文介绍了视频块混合注意力（VMoBA），这是一种专为VDMs设计的新型稀疏注意力机制。通过对预训练视频变换器中注意力模式的深入分析，揭示了强烈的时空局部性、查询重要性变化及头部特定集中度，VMoBA在原有MoBA框架基础上进行了三项关键改进：（1）采用层级递归块划分方案（1D-2D-3D），动态适应多样化的时空注意力模式并提升效率；（2）全局块选择，优先考虑整个注意力头中最显著的查询-键块交互；（3）基于阈值的块选择，根据累积相似度动态确定参与注意力的块数。大量实验表明，VMoBA显著加速了VDMs在长序列上的训练，实现了2.92倍的浮点运算（FLOPs）和1.48倍的延迟加速，同时生成质量与全注意力相当甚至更优。此外，VMoBA在无需训练的推理中展现出竞争力，为高分辨率视频生成提供了2.40倍的FLOPs和1.35倍的延迟加速。

English

The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

VMoBA：面向视频扩散模型的混合块注意力机制

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

摘要

Support