Stochastische KV-Routing: Adaptieve Dieptegewijze Cache-Deling Mogelijk Maken

Samenvatting

Het serveren van transformer-taalmmodellen met hoge doorvoersnelheid vereist het cachen van Key-Values (KVs) om redundante berekeningen tijdens autoregressieve generatie te voorkomen. De geheugenvoetafdruk van KV-caching is aanzienlijk en heeft een grote impact op de serviciekosten. Dit werk stelt voor om deze geheugeneisen te verminderen. Terwijl recent onderzoek zich voornamelijk heeft gericht op KV-cache-reductie via compressie en verdringing langs de temporele as, beargumenteren wij dat de dieptedimensie een orthogonale en robuuste weg voor optimalisatie biedt. Hoewel eerder onderzoek suggereert dat een volledige cache voor elke laag redundant is, blijft de implementatie van cache-deling tussen lagen een praktische uitdaging; bestaande methoden lijden doorgaans onder verminderde doorvoersnelheid of een verlengde tijd-tot-eerste-token. In dit artikel tonen we aan dat het weglaten van de cache van een laag een efficiënte optimalisatie biedt zonder verlies van informatie. Wij stellen een eenvoudige trainingsaanpak voor: willekeurige aandacht tussen lagen (random cross-layer attention). Tijdens de training kiezen lagen willekeurig om aandacht te besteden aan hun eigen KV-toestanden of aan die van een voorgaande laag. Dit stochastische proces past het model aan om robuust te zijn voor verschillende cache-delingstrategieën in de diepte, wat flexibiliteit waarborgt voor onbekende hardwarebeperkingen tijdens de implementatie. Onze evaluaties tonen aan dat het toepassen van dit schema tijdens pre-training of fine-tuning cache-deling in de diepte mogelijk maakt voor verschillende modelfamilies. Bovendien suggereert deze aanpak voor grotere modellen in data-arme omgevingen een regularisatie-achtig effect, waarbij de prestaties vaak behouden blijven of verbeteren terwijl de geheugenvoetafdruk van de cache aanzienlijk wordt verkleind.

English

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the depth dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.

Stochastische KV-Routing: Adaptieve Dieptegewijze Cache-Deling Mogelijk Maken

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Samenvatting

Support