SLA2：具可學習路由與量化感知訓練的稀疏線性注意力機制

摘要

稀疏線性注意力（SLA）結合了稀疏注意力與線性注意力以加速擴散模型，並在影片生成任務中展現出卓越性能。然而存在兩點侷限：（i）SLA依賴啟發式分割規則，根據注意力權重大小將計算分配至稀疏或線性分支，此策略可能非最優；（ii）通過形式化分析SLA的注意力誤差，我們發現其與直接分解為稀疏和線性注意力的做法存在錯配。為此我們提出SLA2，其包含三項創新：（I）可學習路由模組動態決定每項注意力計算應採用稀疏或線性注意力；（II）更忠實的直接稀疏-線性注意力公式，通過可學習比例係數融合兩分支；（III）稀疏+低比特注意力架構，透過量化感知微調引入低比特注意力以降低量化誤差。實驗表明，在影片擴散模型中SLA2可實現97%的注意力稀疏度，在保持生成質量的同時將注意力計算速度提升18.6倍。

English

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2：具可學習路由與量化感知訓練的稀疏線性注意力機制

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

摘要

Support