OScaR: Het scheermes van Ockham voor extreme KV-cachekwantisering in LLM's en daarbuiten

Samenvatting

De snelle vooruitgang richting redeneren over lange contexten en multimodale intelligentie heeft de geheugenvoetafdruk van de Key-Value (KV) cache gemaakt tot een dominant geheugenknelpunt voor efficiënte implementatie. Hoewel de gevestigde per-kanaalkwantisering effectief omgaat met intrinsieke kanaalsgewijze uitschieters in Key-tensoren, neemt de effectiviteit ervan af onder extreme compressie. In dit werk herzien we de inherente beperkingen van het per-kanaalkwantisatieparadigma vanuit zowel empirische als theoretische perspectieven. Onze analyse identificeert Token Norm Imbalance (TNI) als de primaire bottleneck voor de kwantiseringgetrouwheid. We tonen aan dat TNI systematisch fouten versterkt wanneer gedeelde kwantisatieparameters moeten worden toegepast op tokengroepen die aanzienlijke normverschillen vertonen. In plaats van te vertrouwen op ingewikkelde kwantisatiepijplijnen (bijv. TurboQuant), stellen we OScaR (Omni-Scaled Canalized Rotation) voor, een accuraat en lichtgewicht compressieframework voor de KV-cache van X-LLMs (d.w.z. tekst-only, multimodale en omni-modale LLMs). Voortbouwend op het per-kanaalparadigma gebruikt OScaR Canalized Rotation gevolgd door Omni-Token Scaling om de door TNI veroorzaakte sequentiedimensionale variantie zowel effectief als efficiënt te beperken, verder ondersteund door ons geoptimaliseerde systeemontwerp en CUDA-kernels. Uitgebreide evaluaties over X-LLMs heen tonen aan dat OScaR consequent beter presteert dan bestaande methoden en bijna-verliesvrije prestaties behaalt onder INT2-kwantisatie, wat het vestigt als een robuust, laagcomplex en universeel framework dat een nieuw Pareto-front definieert. Vergeleken met de BF16 FlashDecoding-v2-baseline behaalt onze OScaR-implementatie een opmerkelijke versnelling tot 3,0x in decodering, vermindert de geheugenvoetafdruk met 5,3x en verhoogt de doorvoer met 4,1x. De code voor OScaR is openbaar beschikbaar op https://github.com/ZunhaiSu/OScaR-KV-Quant.

English

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.