OScaR : Le rasoir d'Occam pour la quantification extrême du cache KV dans les LLM et au-delà

Résumé

Les progrès rapides vers le raisonnement à long contexte et l'intelligence multimodale ont fait de l'empreinte mémoire du cache Key-Value (KV) un goulot d'étranglement mémoire dominant pour un déploiement efficace. Bien que la quantification établie par canal s'accommode efficacement des valeurs aberrantes intrinsèques par canal dans les tenseurs Key, son efficacité diminue sous une compression extrême. Dans ce travail, nous réexaminons les limitations inhérentes du paradigme de quantification par canal d'un point de vue tant empirique que théorique. Notre analyse identifie le Déséquilibre de Norme des Tokens (TNI) comme le principal goulot d'étranglement pour la fidélité de la quantification. Nous démontrons que le TNI amplifie systématiquement les erreurs lorsque des paramètres de quantification partagés doivent couvrir des groupes de tokens présentant des disparités de norme substantielles. Sans recourir à des pipelines de quantification complexes (par exemple, TurboQuant), nous proposons OScaR (Omni-Scaled Canalized Rotation), un cadre précis et léger de compression de cache KV pour les X-LLM (c'est-à-dire les LLM textuels, multimodaux et omnimodaux). Prolongeant le paradigme par canal, OScaR utilise la Rotation Canalysée suivie d'une Mise à l'échelle Omni-Token pour atténuer la variance dimensionnelle de séquence induite par le TNI de manière à la fois efficace et efficiente, soutenue en outre par notre conception système optimisée et nos noyaux CUDA. Des évaluations approfondies sur des X-LLM montrent qu'OScaR surpasse systématiquement les méthodes existantes et atteint des performances quasi sans perte sous quantification INT2, ce qui en fait un cadre robuste, de faible complexité et universel, définissant un nouveau front de Pareto. Comparé à la référence BF16 FlashDecoding-v2, notre implémentation d'OScaR atteint un gain de vitesse notable allant jusqu'à 3,0x lors du décodage, réduit l'empreinte mémoire de 5,3x et augmente le débit de 4,1x. Le code d'OScaR est disponible publiquement à l'adresse https://github.com/ZunhaiSu/OScaR-KV-Quant.

English

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.