KV缓存压缩的潜在问题

摘要

KV缓存压缩技术承诺在性能损失可忽略的前提下提升吞吐量与效率。尽管吞吐量的提升毋庸置疑，且近期文献确实表明在特定基准测试中性能下降微乎其微，但在多指令提示等现实场景下，压缩带来的影响尚未得到充分研究。本文中，我们指出了实践者在部署采用KV缓存压缩的大型语言模型（LLMs）时应注意的几个潜在问题。尤为重要的是，我们发现某些指令在压缩后性能急剧下降，导致LLM几乎完全忽略这些指令。作为这一现象的实际例证，我们以系统提示泄露为案例，实证展示了压缩对泄露及指令遵循的普遍影响。我们揭示了影响提示泄露的几个关键因素：压缩方法、指令顺序以及KV淘汰偏好。随后，我们提出了对KV缓存淘汰策略的简单调整，旨在减轻这些因素的影响，从而提升多指令任务的整体表现。

English

KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

KV缓存压缩的潜在问题

The Pitfalls of KV Cache Compression

摘要

Support