SLA2:具备可学习路由与量化感知训练的稀疏线性注意力机制
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
February 13, 2026
作者: Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez
cs.AI
摘要
稀疏线性注意力(SLA)通过结合稀疏注意力与线性注意力来加速扩散模型,在视频生成任务中展现出强劲性能。然而存在两个问题:(i)SLA依赖基于注意力权重大小的启发式分割策略来分配稀疏分支与线性分支的计算任务,这种策略可能并非最优;(ii)通过形式化分析SLA的注意力误差,我们发现其与直接分解为稀疏注意力和线性注意力的方案存在错配。为此我们提出SLA2模型,其创新点包括:(I)引入可学习路由器动态选择每个注意力计算应采用的稀疏/线性注意力模式;(II)采用更忠实于理论推导的稀疏-线性注意力公式,通过可学习比例系数融合两个分支;(III)设计稀疏+低比特注意力架构,通过量化感知微调引入低比特注意力以降低量化误差。实验表明,在视频扩散模型中SLA2可实现97%的注意力稀疏度,在保持生成质量的同时获得18.6倍的注意力计算加速。
English
Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.