nablaNABLA: 近隣適応型ブロックレベル注意機構

要旨

トランスフォーマーベースのアーキテクチャにおける最近の進展は、ビデオ生成タスクにおいて顕著な成功を収めています。しかし、完全な注意機構の二次的な計算複雑性は、特に高解像度かつ長時間のビデオシーケンスにおいて、重大なボトルネックとなっています。本論文では、NABLA（Neighborhood Adaptive Block-Level Attention）と呼ばれる新しい近傍適応型ブロックレベル注意機構を提案します。NABLAは、ビデオ拡散トランスフォーマー（DiTs）におけるスパース性パターンに動的に適応し、適応型スパース性駆動しきい値を活用したブロック単位の注意機構により、計算オーバーヘッドを削減しながら生成品質を維持します。本手法は、カスタムの低レベル演算子設計を必要とせず、PyTorchのFlex Attention演算子とシームレスに統合可能です。実験結果から、NABLAはベースラインと比較して最大2.7倍の高速な学習と推論を実現し、定量的指標（CLIPスコア、VBenchスコア、人間評価スコア）および視覚的品質の低下をほとんど伴わないことが示されています。コードおよびモデル重みは以下のURLで公開されています：https://github.com/gen-ai-team/Wan2.1-NABLA

English

Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA

nablaNABLA: 近隣適応型ブロックレベル注意機構

nablaNABLA: Neighborhood Adaptive Block-Level Attention

要旨

Support