KVzip:基於上下文重建的查詢無關鍵值緩存壓縮技術
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
May 29, 2025
作者: Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song
cs.AI
摘要
基於Transformer的大型語言模型(LLMs)在推理過程中會將上下文緩存為鍵值對(KV對)。隨著上下文長度的增加,KV緩存的大小也會擴展,導致顯著的內存開銷和注意力延遲的增加。本文介紹了KVzip,這是一種與查詢無關的KV緩存淘汰方法,能夠在多樣化的查詢中有效重用壓縮後的KV緩存。KVzip利用底層的LLM來量化KV對的重要性,從緩存的KV對中重建原始上下文,隨後淘汰重要性較低的KV對。大量的實證評估表明,KVzip將KV緩存的大小減少了3-4倍,並將FlashAttention的解碼延遲降低了約2倍,且在問答、檢索、推理和代碼理解任務中的性能損失微乎其微。評估涵蓋了多種模型,如LLaMA3.1-8B、Qwen2.5-14B和Gemma3-12B,上下文長度最高可達170K個令牌。KVzip在多查詢場景下顯著優於現有的查詢感知型KV淘汰方法,後者即使在90%的緩存預算比率下也會出現性能下降。
English
Transformer-based large language models (LLMs) cache context as key-value
(KV) pairs during inference. As context length grows, KV cache sizes expand,
leading to substantial memory overhead and increased attention latency. This
paper introduces KVzip, a query-agnostic KV cache eviction method enabling
effective reuse of compressed KV caches across diverse queries. KVzip
quantifies the importance of a KV pair using the underlying LLM to reconstruct
original contexts from cached KV pairs, subsequently evicting pairs with lower
importance. Extensive empirical evaluations demonstrate that KVzip reduces KV
cache size by 3-4times and FlashAttention decoding latency by approximately
2times, with negligible performance loss in question-answering, retrieval,
reasoning, and code comprehension tasks. Evaluations include various models
such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching
up to 170K tokens. KVzip significantly outperforms existing query-aware KV
eviction methods, which suffer from performance degradation even at a 90% cache
budget ratio under multi-query scenarios.Summary
AI-Generated Summary