ChatPaper.aiChatPaper

SSA:通过特征空间中对齐全注意力与稀疏注意力输出的稀疏稀疏注意力机制

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

November 25, 2025
作者: Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun
cs.AI

摘要

全注意力机制的二次复杂度限制了大型语言模型(LLMs)在处理长上下文时的高效性。稀疏注意力通过限制每个查询仅关注前文标记的子集来降低计算成本,但无需训练的方法常导致严重的性能下降。原生稀疏注意力方法(如NSA、MoBA)虽能缓解此问题,却存在一个关键悖论:尽管旨在逼近全注意力效果,其产生的注意力稀疏度反而低于全注意力模型,这可能制约其有效性。我们将此悖论归因于梯度更新缺陷:在稀疏训练期间被排除的低秩键值对既无前向贡献也无反向梯度,因而无法学习适当的抑制机制。为突破此限制,我们提出SSA(稀疏稀疏注意力),这一统一训练框架同时考虑稀疏与全注意力,并在每一层强制执行双向对齐。该设计在保持所有标记梯度流动的同时,显式促使稀疏注意力输出与其全注意力对应项对齐,从而增强稀疏性。实验表明,SSA在多个常识推理基准上均实现了稀疏与全注意力推断下的最优性能。此外,SSA使模型能平滑适应不同的稀疏预算:随着可参与注意力的标记数增加,性能持续提升,支持推理时灵活的计算-性能权衡。最后我们发现,原生稀疏注意力训练通过缓解注意力值在汇聚区的过度分配,意外提升了长上下文外推能力,其中SSA展现出最强的外推性能。
English
The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
PDF253December 1, 2025