徑向注意力:用於長視頻生成的O(nlog n)稀疏注意力機制與能量衰減
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
June 24, 2025
作者: Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
cs.AI
摘要
近期擴散模型的進展已實現了高品質的影片生成,但額外的時間維度顯著增加了計算成本,使得長影片的訓練與推理變得極其昂貴。本文中,我們發現了一種現象,稱之為影片擴散模型中的時空能量衰減:後softmax注意力分數隨著token之間空間與時間距離的增加而減弱,類似於自然界中訊號或波在空間與時間上的物理衰減。基於此,我們提出了徑向注意力(Radial Attention),這是一種可擴展的稀疏注意力機制,具有O(n log n)的複雜度,將能量衰減轉化為指數衰減的計算密度,相比標準的O(n^2)密集注意力顯著更高效,且比線性注意力更具表現力。具體而言,徑向注意力採用了一種簡單、靜態的注意力遮罩,其中每個token僅關注空間上鄰近的token,且注意力窗口大小隨時間距離縮小。此外,它允許預訓練的影片擴散模型通過高效的LoRA微調來擴展其生成長度。大量實驗表明,徑向注意力在Wan2.1-14B、HunyuanVideo和Mochi 1上均保持了影片質量,相比原始密集注意力實現了最高1.9倍的加速。通過最小程度的調整,它能夠生成長度達4倍的影片,同時相比直接微調減少最高4.4倍的訓練成本,並相比密集注意力推理加速最高3.7倍。
English
Recent advances in diffusion models have enabled high-quality video
generation, but the additional temporal dimension significantly increases
computational costs, making training and inference on long videos prohibitively
expensive. In this paper, we identify a phenomenon we term Spatiotemporal
Energy Decay in video diffusion models: post-softmax attention scores diminish
as spatial and temporal distance between tokens increase, akin to the physical
decay of signal or waves over space and time in nature. Motivated by this, we
propose Radial Attention, a scalable sparse attention mechanism with O(n log
n) complexity that translates energy decay into exponentially decaying compute
density, which is significantly more efficient than standard O(n^2) dense
attention and more expressive than linear attention. Specifically, Radial
Attention employs a simple, static attention mask where each token attends to
spatially nearby tokens, with the attention window size shrinking with temporal
distance. Moreover, it allows pre-trained video diffusion models to extend
their generation length with efficient LoRA-based fine-tuning. Extensive
experiments show that Radial Attention maintains video quality across
Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9times speedup
over the original dense attention. With minimal tuning, it enables video
generation up to 4times longer while reducing training costs by up to
4.4times compared to direct fine-tuning and accelerating inference by up to
3.7times compared to dense attention inference.