VMoBA：面向视频扩散模型的混合块注意力机制

摘要

全注意力機制的二次方複雜度對旨在生成長時間、高解析度視頻的視頻擴散模型（VDMs）構成了顯著的瓶頸。儘管已提出多種稀疏注意力方法，但許多方法被設計為無需訓練的推理加速器，或在原生訓練時未能最優地捕捉視頻數據固有的獨特時空特性。本文介紹了視頻塊注意力混合機制（VMoBA），這是一種專門為VDMs設計的新穎稀疏注意力機制。基於對預訓練視頻變壓器中注意力模式的深入分析，揭示了強烈的時空局部性、查詢重要性變化以及頭部特定集中水平，VMoBA對原始MoBA框架進行了三項關鍵改進：（1）層次遞歸塊劃分方案（1D-2D-3D），以動態適應多樣的時空注意力模式並提升效率；（2）全局塊選擇，優先考慮整個注意力頭中最顯著的查詢-鍵塊交互；（3）基於閾值的塊選擇，根據累積相似度動態確定參與塊的數量。大量實驗表明，VMoBA顯著加速了VDMs在更長序列上的訓練，實現了2.92倍的浮點運算（FLOPs）和1.48倍的延遲加速，同時達到了與全注意力相當甚至更優的生成質量。此外，VMoBA在無需訓練的推理中展現了競爭力，為高解析度視頻生成提供了2.40倍的FLOPs和1.35倍的延遲加速。

English

The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

VMoBA：面向视频扩散模型的混合块注意力机制

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

摘要

Support