HISA:面向细粒度稀疏注意力的高效分层索引机制
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
March 30, 2026
作者: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang
cs.AI
摘要
以深度稀疏注意力(DSA)为代表的令牌级稀疏注意力机制,通过轻量级索引器为每个查询对历史令牌进行细粒度评分,仅对选定子集计算注意力。虽然下游稀疏注意力能高效扩展,但索引器仍需为每个查询扫描全部前缀,导致每层存在O(L²)计算瓶颈,该问题随上下文长度增加而愈发突出。我们提出分层索引稀疏注意力(HISA),作为索引器的即插即用替代方案,将搜索过程从扁平化令牌扫描转为两阶段分层处理:首先通过块级粗筛器对聚合的块表征评分以剪枝无关区域,随后在候选块内应用原索引器进行令牌级精筛。HISA完整保留下游稀疏多头注意力算子所需的精确令牌级Top-K稀疏模式,且无需额外训练。在核级基准测试中,HISA在32K上下文长度时实现2倍加速,在128K时达到4倍加速。在Needle-in-a-Haystack和LongBench测试中,我们直接将DeepSeek-V3.2的索引器替换为HISA而未进行微调。HISA在质量上与原DSA高度吻合,同时显著优于块稀疏基线。此外,HISA与原DSA产生的令牌选择集平均交并比超过99%,表明其效率提升几乎不影响选择保真度。
English
Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O(L^2) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2times speedup at 32K context length and 4times at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.