CommVQ: Commutatieve Vector Kwantisatie voor KV Cache Compressie

Samenvatting

Grote Taalmodellen (LLMs) worden steeds vaker gebruikt in toepassingen die lange contextlengtes vereisen, maar de key-value (KV) cache wordt vaak een geheugenknelpunt op GPU's naarmate de context groeit. Om dit aan te pakken, stellen we Commutative Vector Quantization (CommVQ) voor om het geheugengebruik voor lange-context LLM-inferentie aanzienlijk te verminderen. We introduceren eerst additieve kwantisatie met een lichtgewicht encoder en codebook om de KV cache te comprimeren, die kan worden gedecodeerd via eenvoudige matrixvermenigvuldiging. Om de rekenkosten tijdens het decoderen verder te verlagen, ontwerpen we het codebook om commutatief te zijn met Rotary Position Embedding (RoPE) en trainen we het met een Expectation-Maximization (EM) algoritme. Dit maakt een efficiënte integratie van decodering in het self-attention mechanisme mogelijk. Onze aanpak bereikt hoge nauwkeurigheid met additieve kwantisatie en lage overhead via het RoPE-commutatieve codebook. Experimenten op lange-context benchmarks en GSM8K laten zien dat onze methode de FP16 KV cache-grootte met 87,5% reduceert met 2-bit kwantisatie, terwijl het state-of-the-art KV cache kwantiseringsmethoden overtreft. Opmerkelijk is dat het 1-bit KV cache kwantisatie mogelijk maakt met minimale nauwkeurigheidsverliezen, waardoor een LLaMA-3.1 8B model kan draaien met een contextlengte van 128K op een enkele RTX 4090 GPU. De broncode is beschikbaar op: https://github.com/UMass-Embodied-AGI/CommVQ.

English

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.

CommVQ: Commutatieve Vector Kwantisatie voor KV Cache Compressie

CommVQ: Commutative Vector Quantization for KV Cache Compression

Samenvatting

Support