CommVQ：面向KV缓存压缩的可交换向量量化技术

摘要

大型语言模型（LLMs）在需要长上下文的应用中日益普及，但随着上下文增长，键值（KV）缓存往往成为GPU上的内存瓶颈。为解决这一问题，我们提出了可交换向量量化（CommVQ），以显著降低长上下文LLM推理中的内存使用。我们首先引入了一种带有轻量级编码器和码本的加法量化方法，用于压缩KV缓存，该缓存可通过简单的矩阵乘法解码。为进一步降低解码过程中的计算成本，我们设计了与旋转位置嵌入（RoPE）可交换的码本，并使用期望最大化（EM）算法进行训练。这使得解码能够高效地集成到自注意力机制中。我们的方法通过加法量化实现了高精度，并通过RoPE可交换码本实现了低开销。在长上下文基准测试和GSM8K上的实验表明，我们的方法在使用2位量化时将FP16 KV缓存大小减少了87.5%，同时优于最先进的KV缓存量化方法。值得注意的是，它实现了1位KV缓存量化，且精度损失最小，使得LLaMA-3.1 8B模型能够在单个RTX 4090 GPU上运行128K的上下文长度。源代码可在以下网址获取：https://github.com/UMass-Embodied-AGI/CommVQ。

English

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.

CommVQ：面向KV缓存压缩的可交换向量量化技术

CommVQ: Commutative Vector Quantization for KV Cache Compression

摘要

Support