VMoBA: ビデオ拡散モデルのためのMixture-of-Blockアテンション

要旨

完全なアテンションメカニズムの二次的な計算複雑性は、長時間かつ高解像度のビデオ生成を目指すビデオ拡散モデル（VDM）にとって重大なボトルネックとなっています。これまでに様々なスパースアテンションメソッドが提案されてきましたが、その多くはトレーニング不要の推論加速器として設計されているか、ビデオデータに固有の時空間的特性をネイティブにトレーニングする際に最適に捕捉できていません。本論文では、VDMに特化して適応された新しいスパースアテンションメカニズムであるVideo Mixture of Block Attention（VMoBA）を紹介します。事前学習済みのビデオトランスフォーマー内のアテーションパターンを詳細に分析した結果、強い時空間的局所性、クエリの重要性のばらつき、およびヘッドごとの集中レベルが明らかになりました。これに基づき、VMoBAは元のMoBAフレームワークを以下の3つの主要な改良点で強化しています：（1）多様な時空間的アテーションパターンに動的に適応し効率を向上させるための層ごとの再帰的ブロック分割スキーム（1D-2D-3D）、（2）アテンションヘッド全体で最も重要なクエリ-キーブロック相互作用を優先するためのグローバルブロック選択、（3）累積類似度に基づいて動的にアテンションするブロック数を決定するための閾値ベースのブロック選択。大規模な実験により、VMoBAが長いシーケンスでのVDMのトレーニングを大幅に加速し、2.92倍のFLOPsと1.48倍のレイテンシ速度向上を達成しつつ、完全なアテンションと同等またはそれ以上の生成品質を実現することが示されました。さらに、VMoBAはトレーニング不要の推論においても競争力のある性能を示し、高解像度ビデオ生成において2.40倍のFLOPsと1.35倍のレイテンシ速度向上を提供します。

English

The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

VMoBA: ビデオ拡散モデルのためのMixture-of-Blockアテンション

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

要旨

Support