IndexCache：透過跨層索引重複使用加速稀疏注意力機制

摘要

長上下文代理工作流已成為大型語言模型的關鍵應用場景，這使得注意力效率對推理速度和服務成本至關重要。稀疏注意力能有效解決這一挑戰，其中DeepSeek稀疏注意力（DSA）是具代表性的生產級解決方案：其輕量級閃電索引器會為每個查詢選取前k個最相關的標記，將核心注意力計算複雜度從O(L²)降至O(Lk)。然而索引器本身仍保持O(L²)複雜度，且需在每個層獨立運行，儘管相鄰層產生的前k選擇具有高度相似性。我們提出IndexCache技術，通過將網絡層劃分為兩類來利用這種跨層冗餘：少量完整層運行獨立索引器，多數共享層則直接復用最近完整層的前k索引。我們提出兩種互補的配置優化方法：免訓練的IndexCache採用貪婪搜索算法，通過在校準集上直接最小化語言建模損失來選擇保留索引器的層，無需權重更新；支持訓練的IndexCache引入多層蒸餾損失，使每個保留的索引器針對其服務的所有層的平均注意力分佈進行訓練，即使簡單交錯層模式也能匹配全索引器精度。在30B參數DSA模型上的實驗表明，IndexCache可消除75%的索引器計算且質量損失可忽略不計，相比標準DSA實現預填充階段最高加速1.82倍，解碼階段加速1.48倍。我們在生產級GLM-5模型上的初步實驗進一步驗證了這些積極成果（圖1）。

English

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

IndexCache：透過跨層索引重複使用加速稀疏注意力機制

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

摘要

Support