一种简单且有效的基于L2范数的KV缓存压缩策略

摘要

大型语言模型（LLMs）的部署经常受到键-值（KV）缓存的大量内存需求的阻碍，特别是随着上下文长度的增加。现有的减小KV缓存大小的方法包括微调模型以学习压缩策略或利用注意力分数减少序列长度。我们分析了仅包含解码器的基于Transformer的模型中的注意力分布，并观察到大多数层中的注意力分配模式保持一致。令人惊讶的是，我们发现缓存的KV对中的L_2和注意力分数之间存在明显的相关性，其中键嵌入的低L_2通常会导致解码过程中的高注意力分数。这一发现表明，KV对的影响可能在被查询之前就由键嵌入本身确定。基于这一观察，我们根据键嵌入的L_2压缩KV缓存。我们的实验结果显示，这一简单策略可以在语言建模和寻找针在一堆草中的任务中将KV缓存大小减少50％，在密码检索任务中减少90％，而不会丢失准确性。

English

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L_2 and the attention scores over cached KV pairs, where a low L_2 of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L_2 of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.

一种简单且有效的基于L2范数的KV缓存压缩策略

A Simple and Effective L_2 Norm-Based Strategy for KV Cache Compression

摘要

Support