KV缓存压缩的潜在问题
The Pitfalls of KV Cache Compression
September 30, 2025
作者: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel
cs.AI
摘要
KV缓存压缩技术承诺在性能损失可忽略的前提下提升吞吐量与效率。尽管吞吐量的提升毋庸置疑,且近期文献确实表明在特定基准测试中性能下降微乎其微,但在多指令提示等现实场景下,压缩带来的影响尚未得到充分研究。本文中,我们指出了实践者在部署采用KV缓存压缩的大型语言模型(LLMs)时应注意的几个潜在问题。尤为重要的是,我们发现某些指令在压缩后性能急剧下降,导致LLM几乎完全忽略这些指令。作为这一现象的实际例证,我们以系统提示泄露为案例,实证展示了压缩对泄露及指令遵循的普遍影响。我们揭示了影响提示泄露的几个关键因素:压缩方法、指令顺序以及KV淘汰偏好。随后,我们提出了对KV缓存淘汰策略的简单调整,旨在减轻这些因素的影响,从而提升多指令任务的整体表现。
English
KV cache compression promises increased throughput and efficiency with
negligible loss in performance. While the gains in throughput are indisputable
and recent literature has indeed shown minimal degradation on particular
benchmarks, in general the consequences of compression in realistic scenarios
such as multi-instruction prompting have been insufficiently studied. In this
paper, we identify several pitfalls practitioners should be aware of when
deploying KV cache compressed LLMs. Importantly, we show that certain
instructions degrade much more rapidly with compression, effectively causing
them to be completely ignored by the LLM. As a practical example of that, we
highlight system prompt leakage as a case study, empirically showing the impact
of compression on leakage and general instruction following. We show several
factors that play a role in prompt leakage: compression method, instruction
order, and KV eviction bias. We then propose simple changes to KV cache
eviction policies that can reduce the impact of these factors and improve the
overall performance in multi-instruction tasks.