IndexCache: Versnelling van Sparse Attention via Cross-Layer Indexhergebruik

Samenvatting

Langetermijn-agentgebaseerde workflows zijn uitgegroeid tot een bepalende use case voor grote taalmodellen, waardoor aandachtsefficiëntie cruciaal is voor zowel inferentiesnelheid als servingskosten. Sparse attention lost deze uitdaging effectief op, en DeepSeek Sparse Attention (DSA) is een representatieve productieklasse-oplossing: een lichtgewicht lightning-indexeerder selecteert de top-k meest relevante tokens per query, wat de kernattention reduceert van O(L²) naar O(Lk). De indexeerder zelf behoudt echter een O(L²)-complexiteit en moet onafhankelijk op elke laag draaien, ondanks het feit dat de resulterende top-k-selecties sterk overeenkomen tussen opeenvolgende lagen. Wij presenteren IndexCache, dat deze cross-layer-redundantie benut door lagen op te delen in een kleine set Volledige lagen die hun eigen indexeerders draaien en een meerderheid van Gedeelde lagen die simpelweg de top-k-indices van de dichtstbijzijnde Volledige laag hergebruiken. Wij stellen twee complementaire benaderingen voor om deze configuratie te bepalen en optimaliseren. Training-free IndexCache past een greedy zoekalgoritme toe dat selecteert welke lagen indexeerders behouden door direct de taalmodelleringsverlies op een calibratieset te minimaliseren, zonder gewichtsupdates. Training-aware IndexCache introduceert een multi-layer distillatieverlies dat elke behouden indexeerder traint tegen de gemiddelde aandachtverdelingen van alle lagen die hij bedient, waardoor zelfs eenvoudige interleaved patronen de nauwkeurigheid van volledige indexeerders evenaren. Experimentele resultaten op een 30B DSA-model tonen aan dat IndexCache 75% van de indexeerberekeningen kan verwijderen met verwaarloosbare kwaliteitsafname, wat resulteert in tot 1.82× prefill-versnelling en 1.48× decode-versnelling vergeleken met standaard DSA. Deze positieve resultaten worden verder bevestigd door onze preliminaire experimenten op het productieschaal GLM-5-model (Figuur 1).

English

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

IndexCache: Versnelling van Sparse Attention via Cross-Layer Indexhergebruik

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Samenvatting

Support