xKV：面向KV缓存压缩的跨层奇异值分解

摘要

具有長上下文窗口的大型語言模型（LLMs）能夠實現強大的應用，但代價是存儲鍵和值狀態（KV-Cache）的高內存消耗。最近的研究嘗試將多層的KV-Cache合併為共享表示，然而這些方法要么需要昂貴的預訓練，要么依賴於層間高每詞餘弦相似度的假設，而這在實踐中通常不成立。我們發現，主導奇異向量在多層KV-Cache中表現出顯著的對齊性。利用這一洞察，我們提出了xKV，這是一種簡單的訓練後方法，對分組層的KV-Cache應用奇異值分解（SVD）。xKV將多層的KV-Cache整合到一個共享的低秩子空間中，顯著減小了KV-Cache的大小。通過在RULER長上下文基準上對廣泛使用的LLMs（如Llama-3.1和Qwen2.5）進行廣泛評估，xKV實現了比最先進的層間技術高達6.8倍的壓縮率，同時將準確率提高了2.7%。此外，xKV與新興的多頭潛在注意力（MLA，如DeepSeek-Coder-V2）兼容，在編碼任務上實現了顯著的3倍壓縮率，且無性能下降。這些結果凸顯了xKV在解決長上下文LLM推理內存瓶頸方面的強大能力和多功能性。我們的代碼公開於：https://github.com/abdelfattah-lab/xKV。

English

Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

xKV：面向KV缓存压缩的跨层奇异值分解

xKV: Cross-Layer SVD for KV-Cache Compression

摘要

Support