SLA2:具可學習路由與量化感知訓練的稀疏線性注意力機制
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
February 13, 2026
作者: Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez
cs.AI
摘要
稀疏線性注意力(SLA)結合了稀疏注意力與線性注意力以加速擴散模型,並在影片生成任務中展現出卓越性能。然而存在兩點侷限:(i)SLA依賴啟發式分割規則,根據注意力權重大小將計算分配至稀疏或線性分支,此策略可能非最優;(ii)通過形式化分析SLA的注意力誤差,我們發現其與直接分解為稀疏和線性注意力的做法存在錯配。為此我們提出SLA2,其包含三項創新:(I)可學習路由模組動態決定每項注意力計算應採用稀疏或線性注意力;(II)更忠實的直接稀疏-線性注意力公式,通過可學習比例係數融合兩分支;(III)稀疏+低比特注意力架構,透過量化感知微調引入低比特注意力以降低量化誤差。實驗表明,在影片擴散模型中SLA2可實現97%的注意力稀疏度,在保持生成質量的同時將注意力計算速度提升18.6倍。
English
Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.