SLA2: 学習可能なルーティングとQATを備えたスパース線形アテンション

要旨

Sparse-Linear Attention（SLA）は、スパースアテンションと線形アテンションを組み合わせることで拡散モデルを高速化し、動画生成において強力な性能を示しています。しかしながら、(i) SLAはアテンション重みの大きさに基づいて計算をスパース分岐または線形分岐に振り分けるヒューリスティックな分割に依存しており、最適とは限りません。さらに、(ii) SLAのアテンション誤差を形式的に分析した結果、SLAとスパース／線形アテンションへの直接分解との間に不一致があることを明らかにしました。我々はSLA2を提案します。これは、(I) 各アテンション計算をスパースアテンションと線形アテンションのどちらで行うかを動的に選択する学習可能なルーター、(II) 学習可能な比率を用いてスパース分岐と線形分岐を結合する、より忠実で直接的なスパース-線形アテンションの定式化、(III) 量子化誤差を低減するために量子化認識ファインチューニングにより導入される低ビットアテンションを組み合わせた「スパース＋低ビットアテンション」設計を導入します。実験により、動画拡散モデルにおいてSLA2が97%のアテンション疎性を達成し、生成品質を維持しながらアテンション速度を18.6倍向上させられることが示されています。

English

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2: 学習可能なルーティングとQATを備えたスパース線形アテンション

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

要旨

Support