ラジアルアテンション：長尺動画生成のためのエネルギー減衰を伴うO(nlog n)スパースアテンション

要旨

拡散モデルの最近の進歩により、高品質なビデオ生成が可能になりましたが、追加された時間次元によって計算コストが大幅に増加し、長いビデオのトレーニングや推論が非常に高価になっています。本論文では、ビデオ拡散モデルにおける「時空間エネルギー減衰」と呼ばれる現象を特定しました。これは、トークン間の空間的および時間的距離が増加するにつれて、ソフトマックス後のアテンションスコアが減少する現象で、自然界における信号や波の物理的減衰に似ています。これに着想を得て、我々はRadial Attentionを提案します。これはO(n log n)の複雑性を持つスケーラブルなスパースアテンションメカニズムで、エネルギー減衰を指数関数的に減衰する計算密度に変換し、標準的なO(n^2)の密なアテンションよりも大幅に効率的で、線形アテンションよりも表現力が豊かです。具体的には、Radial Attentionは各トークンが空間的に近いトークンに注意を向けるシンプルで静的なアテンションマスクを使用し、アテンションウィンドウのサイズが時間的距離とともに縮小します。さらに、事前にトレーニングされたビデオ拡散モデルが、効率的なLoRAベースのファインチューニングを通じて生成長を拡張することを可能にします。広範な実験により、Radial AttentionがWan2.1-14B、HunyuanVideo、およびMochi 1においてビデオ品質を維持し、元の密なアテンションに対して最大1.9倍の高速化を達成することが示されました。最小限のチューニングで、直接のファインチューニングと比較して最大4倍の長さのビデオ生成を可能にし、トレーニングコストを最大4.4倍削減し、密なアテンション推論と比較して推論を最大3.7倍加速します。

English

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with O(n log n) complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard O(n^2) dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9times speedup over the original dense attention. With minimal tuning, it enables video generation up to 4times longer while reducing training costs by up to 4.4times compared to direct fine-tuning and accelerating inference by up to 3.7times compared to dense attention inference.

ラジアルアテンション：長尺動画生成のためのエネルギー減衰を伴うO(nlog n)スパースアテンション

Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

要旨

Support