SVG-EAR：エラー認識ルーティングによる疎なビデオ生成のためのパラメータ不要線形補償

要旨

Diffusion Transformer（DiT）は映像生成における主要なバックボーンとなっているが、その二次関数的な注意力コストが大きなボトルネックとなっている。疎な注意力は、注意力ブロックの一部のみを計算することでこのコストを削減する。しかし、従来の手法では、残りのブロックを単純に除外して情報損失が生じるか、学習済みの予測器を用いて近似することで、訓練オーバーヘッドや出力分布のずれが生じていた。本論文では、欠落した寄与を訓練なしで回復できることを示す。意味的クラスタリング後、各ブロック内のキーと値は強い類似性を示し、少数のクラスタ重心で十分に要約できる。この観察に基づき、重心を用いてスキップされたブロックを近似し、その寄与を回復するパラメータフリーの線形補償ブランチであるSVG-EARを提案する。重心補償はほとんどのブロックで正確であるが、一部のブロックでは失敗する可能性がある。標準的な疎化では通常、注意力スコアに基づいてブロックを選択するが、これはモデルが注意力を集中させる場所を示すものであり、近似誤差が最大となる場所を示すものではない。そこでSVG-EARは、軽量なプローブが各ブロックの補償誤差を推定し、誤差対コスト比が最も高いブロックを正確に計算するとともに、スキップされたブロックを補償する、誤差を考慮したルーティングを行う。注意力再構成誤差とクラスタリング品質を関連付ける理論的保証を提供し、SVG-EARが品質と効率のトレードオフを改善し、映像拡散タスクにおいて同じ生成忠実度でスループットを向上させることを実証的に示す。全体として、SVG-EARは従来手法を明確にパレート改善し、Wan2.2およびHunyuanVideoにおいて、それぞれPSNR 29.759および31.043を維持しながら、最大1.77倍および1.93倍の高速化を達成する。

English

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77times and 1.93times speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

SVG-EAR：エラー認識ルーティングによる疎なビデオ生成のためのパラメータ不要線形補償

SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

要旨

Support