IndexCache:通过跨层索引复用加速稀疏注意力计算
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
March 12, 2026
作者: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
cs.AI
摘要
长上下文智能工作流已成为大语言模型的关键应用场景,这使得注意力机制效率对推理速度与服务成本至关重要。稀疏注意力能有效应对这一挑战,其中DeepSeek稀疏注意力(DSA)是代表性的生产级解决方案:通过轻量级闪电索引器为每个查询筛选最相关的k个标记,将核心注意力复杂度从O(L²)降至O(Lk)。然而索引器本身仍保持O(L²)复杂度,且需在每一层独立运行,尽管连续层级间的top-k筛选结果高度相似。我们提出IndexCache解决方案,通过将网络层划分为少量运行独立索引器的完整层与大量直接复用最近完整层top-k索引的共享层,有效利用跨层冗余特性。我们提出两种互补的配置优化方法:无训练版IndexCache采用贪心搜索算法,通过在校准集上直接最小化语言建模损失来确定保留索引器的层级,无需权重更新;训练感知版IndexCache引入多层蒸馏损失,使每个保留的索引器针对其服务所有层的注意力分布均值进行训练,即使简单交错层模式也能达到全索引器精度。在30B参数DSA模型上的实验表明,IndexCache可减少75%的索引器计算量且质量损失可忽略,相比标准DSA实现预填充阶段加速1.82倍、解码阶段加速1.48倍。我们在生产级GLM-5模型上的初步实验进一步验证了这些积极成果(图1)。
English
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).