再注意力:通过注意力统计重塑实现超稀疏视觉生成
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
May 28, 2025
作者: Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu
cs.AI
摘要
扩散变换器(DiT)已成为生成高质量视觉内容(如视频和图像)的事实标准模型。其面临的一大瓶颈在于注意力机制,其复杂度随分辨率和视频长度呈二次方增长。减轻这一负担的一种合理方法是采用稀疏注意力,即仅将一部分标记或补丁纳入计算。然而,现有技术在极高稀疏度水平下无法保持视觉质量,甚至可能引入不可忽视的计算开销。为此,我们提出了Re-ttention,通过利用扩散模型的时间冗余性,克服注意力机制中的概率归一化偏移,实现了视觉生成模型的极高稀疏注意力。具体而言,Re-ttention基于先前的softmax分布历史重塑注意力分数,从而在极高稀疏度水平下保持全二次方注意力的视觉质量。在CogVideoX和PixArt DiTs等T2V/T2I模型上的实验结果表明,Re-ttention在推理过程中仅需3.1%的标记,优于FastDiTAttn、Sparse VideoGen和MInference等当代方法。此外,我们通过测量延迟证明,在H100 GPU上,我们的方法能以可忽略的开销成本实现超过45%的端到端延迟减少和超过92%的自注意力延迟减少。代码可在以下网址获取:https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}
English
Diffusion Transformers (DiT) have become the de-facto model for generating
high-quality visual content like videos and images. A huge bottleneck is the
attention mechanism where complexity scales quadratically with resolution and
video length. One logical way to lessen this burden is sparse attention, where
only a subset of tokens or patches are included in the calculation. However,
existing techniques fail to preserve visual quality at extremely high sparsity
levels and might even incur non-negligible compute overheads. % To address this
concern, we propose Re-ttention, which implements very high sparse attention
for visual generation models by leveraging the temporal redundancy of Diffusion
Models to overcome the probabilistic normalization shift within the attention
mechanism. Specifically, Re-ttention reshapes attention scores based on the
prior softmax distribution history in order to preserve the visual quality of
the full quadratic attention at very high sparsity levels. % Experimental
results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate
that Re-ttention requires as few as 3.1\% of the tokens during inference,
outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and
MInference. Further, we measure latency to show that our method can attain over
45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU
at negligible overhead cost.
Code available online here:
https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}Summary
AI-Generated Summary