OSCAR: Offline, spektral kovarianzbewusste Rotation für die 2-Bit-KV-Cache-Quantisierung

Zusammenfassung

Die INT2-KV-Cache-Quantisierung ist attraktiv für die Bereitstellung von LLMs mit langen Kontexten, doch bleibt es schwierig, sie sowohl genau als auch einsetzbar zu gestalten. Einfache Rotationen wie Hadamard-Transformationen reduzieren Ausreißer, führen aber bei INT2 dennoch zu Einbußen, da sie nicht auf die nachgelagerte Attention abgestimmt sind. Wir schlagen OSCAR vor, eine Ultra-Niedrigbit-KV-Cache-Quantisierungsmethode, die offline attention-bewusste Kovarianzstrukturen schätzt und daraus feste Rotationen sowie Clipping-Schwellenwerte für die Quantisierung ableitet. Dadurch wird die KV-Quantisierung an die Kovarianzstrukturen angepasst, die die Attention tatsächlich nutzt. Noch wichtiger ist, dass wir nicht nur eine theoretische Rechtfertigung liefern, sondern auch ein vollständig einsetzbares OSCAR-System mit einem benutzerdefinierten INT2-Attention-Kernel entwickeln, der mit dem Paged-KV-Cache-Serving und fusionierten Kernel-Pipelines kompatibel bleibt, was eine nahtlose Integration in moderne LLM-Bereitstellungsframeworks wie SGLang und vLLM ermöglicht. Wir evaluieren unsere Methoden an aktuellen Reasoning-Modellen mit Reasoning-Traces von bis zu 32.000 Tokens über 5 Aufgaben hinweg. Bei Qwen3-4B-Thinking-2507 und Qwen3-8B reduziert OSCAR den BF16-Genauigkeitsabstand auf 3,78 bzw. 1,42 Punkte, während naive INT2-Rotation auf nahezu Null abfällt. Wir skalieren OSCAR weiter auf Qwen3-32B und GLM-4.7 (358 Mrd. Parameter), wo es effektiv auf dem Niveau von BF16 bleibt. Bei langen Kontexten – RULER-NIAH bis zu 128K – bleibt OSCAR bei beiden Qwen3-Modellen robust, während naive INT2-Rotation zusammenbricht. Systemtechnisch gesehen reduziert OSCAR den KV-Cache-Speicher um etwa das Achtfache, verbessert den Durchsatz bei großen Batch-Größen unter dem gleichen Speicherbudget um bis zu das Siebenfache und beschleunigt die Decodierung mit Batch-Größe 1 um bis zu das Dreifache im Vergleich zu BF16 aufgrund des geringeren Speicherbandbreiten-Overheads.

English

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.