スライディングタイルアテンションを用いた高速ビデオ生成

要旨

拡散トランスフォーマー（DiTs）は、最先端のビデオ生成を実現していますが、計算コストが高くなっています。たとえば、720Pの5秒間のビデオを生成する際、全推論時間の945秒のうち、注意機構だけで800秒を要します。本論文では、この課題に取り組むために、スライディングタイルアテンション（STA）を導入しています。STAは、事前学習されたビデオ拡散モデルにおける注意スコアが主に局所化された3Dウィンドウ内に集中しているという観察に基づいています。STAは、局所的な時空間領域をスライドさせ、その領域に注目することで、完全な注意から冗長性を取り除きます。従来のトークン単位のスライディングウィンドウアテンション（SWA）とは異なり、STAは、ハードウェアに適したスライディングウィンドウデザインに基づいて、タイルごとに操作を行い、表現力を保ちながらハードウェア効率を向上させます。慎重なカーネルレベルの最適化により、STAは初めて効率的な2D/3Dスライディングウィンドウのようなアテンション実装を提供し、58.79%のMFUを達成しています。具体的には、STAはFlashAttention-2（FA2）に対して2.8〜17倍、FlashAttention-3（FA3）に対して1.6〜10倍の速度でアテンションを加速します。主要なビデオDiTであるHunyuanVideoにおいて、STAは品質の低下なしに、FA3の945秒から685秒までのエンドツーエンドのレイテンシを削減し、トレーニングを必要としません。ファインチューニングを可能にすることで、レイテンシを268秒まで低下させ、VBenchでわずか0.09%の低下を実現します。

English

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.

スライディングタイルアテンションを用いた高速ビデオ生成

Fast Video Generation with Sliding Tile Attention

要旨

Support