径向注意力机制:面向长视频生成的O(nlog n)稀疏注意力与能量衰减模型
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
June 24, 2025
作者: Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
cs.AI
摘要
近期扩散模型的进展已能实现高质量视频生成,但额外的时间维度显著增加了计算成本,使得长视频的训练和推理变得极其昂贵。本文中,我们发现了一种在视频扩散模型中称为“时空能量衰减”的现象:后softmax注意力分数随着token间空间和时间距离的增加而减弱,类似于自然界中信号或波在空间和时间上的物理衰减。受此启发,我们提出了径向注意力(Radial Attention),一种可扩展的稀疏注意力机制,其复杂度为O(n log n),将能量衰减转化为指数级递减的计算密度,相比标准的O(n^2)密集注意力显著更高效,且比线性注意力更具表现力。具体而言,径向注意力采用了一种简单、静态的注意力掩码,其中每个token仅关注空间上邻近的token,且注意力窗口大小随时间距离的增大而缩小。此外,它允许预训练的视频扩散模型通过高效的LoRA微调扩展其生成长度。大量实验表明,径向注意力在Wan2.1-14B、HunyuanVideo和Mochi 1上均保持了视频质量,相比原始密集注意力实现了最高1.9倍的加速。通过最小程度的调优,它能够生成长达4倍的视频,同时将训练成本降低至多4.4倍,相比密集注意力推理加速高达3.7倍。
English
Recent advances in diffusion models have enabled high-quality video
generation, but the additional temporal dimension significantly increases
computational costs, making training and inference on long videos prohibitively
expensive. In this paper, we identify a phenomenon we term Spatiotemporal
Energy Decay in video diffusion models: post-softmax attention scores diminish
as spatial and temporal distance between tokens increase, akin to the physical
decay of signal or waves over space and time in nature. Motivated by this, we
propose Radial Attention, a scalable sparse attention mechanism with O(n log
n) complexity that translates energy decay into exponentially decaying compute
density, which is significantly more efficient than standard O(n^2) dense
attention and more expressive than linear attention. Specifically, Radial
Attention employs a simple, static attention mask where each token attends to
spatially nearby tokens, with the attention window size shrinking with temporal
distance. Moreover, it allows pre-trained video diffusion models to extend
their generation length with efficient LoRA-based fine-tuning. Extensive
experiments show that Radial Attention maintains video quality across
Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9times speedup
over the original dense attention. With minimal tuning, it enables video
generation up to 4times longer while reducing training costs by up to
4.4times compared to direct fine-tuning and accelerating inference by up to
3.7times compared to dense attention inference.