xKV: Cross-Layer SVD voor KV-Cache Compressie

Samenvatting

Grote Taalmodellen (LLMs) met lange contextvensters maken krachtige toepassingen mogelijk, maar gaan gepaard met een hoge geheugenconsumptie om de Key- en Value-statussen (KV-Cache) op te slaan. Recente studies hebben geprobeerd de KV-cache van meerdere lagen samen te voegen tot gedeelde representaties, maar deze benaderingen vereisen ofwel kostbare voorafgaande training of zijn gebaseerd op aannames van hoge cosinusgelijkenis per token over lagen heen, wat in de praktijk over het algemeen niet het geval is. Wij ontdekken dat de dominante singuliere vectoren opmerkelijk goed uitgelijnd zijn over meerdere lagen van de KV-Cache. Gebruikmakend van dit inzicht, stellen we xKV voor, een eenvoudige post-trainingsmethode die Singular Value Decomposition (SVD) toepast op de KV-cache van gegroepeerde lagen. xKV consolideert de KV-cache van meerdere lagen tot een gedeelde laag-rangruimte, waardoor de grootte van de KV-cache aanzienlijk wordt verminderd. Door uitgebreide evaluaties op de RULER lange-context benchmark met veelgebruikte LLMs (bijv. Llama-3.1 en Qwen2.5), bereikt xKV tot 6,8x hogere compressiepercentages dan de state-of-the-art inter-layer techniek, terwijl de nauwkeurigheid met 2,7% wordt verbeterd. Bovendien is xKV compatibel met de opkomende Multi-Head Latent Attention (MLA) (bijv. DeepSeek-Coder-V2), wat een opmerkelijke 3x compressiepercentages oplevert bij coderings taken zonder prestatieverlies. Deze resultaten benadrukken de sterke capaciteit en veelzijdigheid van xKV bij het aanpakken van geheugenknelpunten voor lange-context LLM-inferentie. Onze code is publiekelijk beschikbaar op: https://github.com/abdelfattah-lab/xKV.

English

Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

xKV: Cross-Layer SVD voor KV-Cache Compressie

xKV: Cross-Layer SVD for KV-Cache Compression

Samenvatting

Support