Re-ttention: 注意統計の再形成による超疎な視覚生成

要旨

Diffusion Transformers (DiT) は、動画や画像といった高品質な視覚コンテンツを生成するためのデファクトスタンダードなモデルとなっています。大きなボトルネックは、解像度と動画の長さに対して計算量が二次的に増加するアテンション機構です。この負担を軽減するための論理的な方法の一つが、スパースアテンションです。これは、計算に含まれるトークンやパッチのサブセットのみを選択する手法です。しかし、既存の技術では、極めて高いスパースレベルで視覚品質を維持することができず、無視できない計算オーバーヘッドが発生する可能性があります。この問題に対処するため、我々はRe-ttentionを提案します。Re-ttentionは、Diffusion Modelsの時間的冗長性を活用して、アテンション機構内の確率的正規化シフトを克服し、視覚生成モデルに対して非常に高いスパースアテンションを実現します。具体的には、Re-ttentionは、以前のソフトマックス分布の履歴に基づいてアテンションスコアを再形成し、極めて高いスパースレベルでも完全な二次アテンションの視覚品質を維持します。CogVideoXやPixArt DiTsといったT2V/T2Iモデルでの実験結果は、Re-ttentionが推論中にわずか3.1%のトークンしか必要とせず、FastDiTAttn、Sparse VideoGen、MInferenceといった現代の手法を上回ることを示しています。さらに、我々はレイテンシを測定し、H100 GPU上で無視できるオーバーヘッドコストで、エンドツーエンドで45%以上、セルフアテンションで92%以上のレイテンシ削減を達成できることを示しました。コードは以下のURLで公開されています: https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}

English

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}

Re-ttention: 注意統計の再形成による超疎な視覚生成

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

要旨

Support