OSCAR: Offline spectrale covariantie-bewuste rotatie voor 2-bit KV-cache kwantisering

Samenvatting

INT2 KV-cache kwantisering is aantrekkelijk voor het serveren van LLM's met lange contexten, maar het blijft moeilijk om zowel accuraat als inzetbaar te maken. Eenvoudige rotaties zoals Hadamard-transformaties verminderen uitschieters, maar degraderen nog steeds bij INT2 omdat ze niet zijn afgestemd op downstream attention. Wij stellen OSCAR voor, een ultra-low-bit KV-cache kwantiseringsmethode die offline attention-aware covariantiestructuren schat en deze gebruikt om vaste rotaties en afkapdrempels voor kwantisering af te leiden. Op deze manier wordt KV-kwantisering afgestemd op de covariantiestructuren die attention daadwerkelijk verbruikt. Belangrijker nog, we bieden niet alleen een theoretische rechtvaardiging, maar ontwikkelen ook een volledig inzetbaar OSCAR-systeem met een aangepaste INT2 attention-kernel die compatibel blijft met paged KV-cache serving en gefuseerde kernel-pijplijnen, waardoor naadloze integratie in moderne LLM-serverframeworks zoals SGLang en vLLM mogelijk wordt. We evalueren onze methoden op recente redeneermodellen met redeneertraces van maximaal 32k tokens over 5 taken. Op Qwen3-4B-Thinking-2507 en Qwen3-8B reduceert OSCAR de BF16-nauwkeurigheidskloof tot respectievelijk 3,78 en 1,42 punten, terwijl naïeve rotatie INT2 instort tot bijna nul. We schalen OSCAR verder naar Qwen3-32B en GLM-4.7 (358B parameters), waar het effectief gelijk blijft aan BF16. Op lange context - RULER-NIAH tot 128K - blijft OSCAR robuust op beide Qwen3-modellen, terwijl naïeve rotatie INT2 instort. Systeemgewijs vermindert OSCAR het KV-cache geheugen met ongeveer 8x, verbetert de doorvoer met tot 7x bij grote batchgroottes onder hetzelfde geheugenbudget, en versnelt batch-size-1 decoderen met tot 3x ten opzichte van BF16 vanwege verminderde geheugenbandbreedte-overhead.

English

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.