ChatPaper.aiChatPaper

KVzip:基于上下文重构的查询无关KV缓存压缩技术

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

May 29, 2025
作者: Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song
cs.AI

摘要

基于Transformer的大型语言模型(LLMs)在推理过程中将上下文缓存为键值对(KV)。随着上下文长度的增加,KV缓存规模随之扩大,导致显著的内存开销和注意力延迟增加。本文提出了KVzip,一种与查询无关的KV缓存淘汰方法,能够在多样化的查询中有效重用压缩后的KV缓存。KVzip通过底层LLM量化KV对的重要性,以从缓存的KV对中重建原始上下文,随后淘汰重要性较低的KV对。大量实验评估表明,KVzip将KV缓存大小减少了3至4倍,并将FlashAttention解码延迟降低了约2倍,同时在问答、检索、推理及代码理解任务中性能损失微乎其微。评估涵盖了多种模型,如LLaMA3.1-8B、Qwen2.5-14B和Gemma3-12B,上下文长度最高可达17万令牌。在多查询场景下,即使缓存预算比达到90%,KVzip也显著优于现有的查询感知型KV淘汰方法,后者在此条件下会出现性能下降。
English
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4times and FlashAttention decoding latency by approximately 2times, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

Summary

AI-Generated Summary

PDF92May 30, 2025