OScaR: Ockhams Rasiermesser für extreme KV-Cache-Quantisierung in LLMs und darüber hinaus

Zusammenfassung

Die rasante Entwicklung hin zu Long-Context-Reasoning und multimodaler Intelligenz hat den Speicherbedarf des Key-Value (KV)-Caches zu einem dominanten Engpass für eine effiziente Bereitstellung gemacht. Während die etablierte kanalweise Quantisierung intrinsische kanalweise Ausreißer in Key-Tensoren effektiv adressiert, lässt ihre Wirksamkeit bei extremer Kompression nach. In dieser Arbeit betrachten wir die inhärenten Limitierungen des kanalweisen Quantisierungsparadigmas sowohl aus empirischer als auch aus theoretischer Perspektive neu. Unsere Analyse identifiziert das Token-Norm-Ungleichgewicht (TNI) als den primären Engpass für die Quantisierungstreue. Wir zeigen, dass TNI systematisch Fehler verstärkt, wenn gemeinsame Quantisierungsparameter über Tokengruppen mit erheblichen Normunterschieden hinweg benötigt werden. Anstatt auf komplexe Quantisierungspipelines (z. B. TurboQuant) zurückzugreifen, schlagen wir OScaR (Omni-Scaled Canalized Rotation) vor, ein genaues und leichtgewichtiges KV-Cache-Kompressionsframework für X-LLMs (d. h. textbasierte, multimodale und omnimodale LLMs). Das kanalweise Paradigma erweiternd, verwendet OScaR eine kanalisierte Rotation gefolgt von einer Omni-Token-Skalierung, um die durch TNI verursachte sequenzdimensionale Varianz sowohl effektiv als auch effizient zu mildern, unterstützt durch unser optimiertes Systemdesign und CUDA-Kernel. Umfangreiche Evaluierungen über X-LLMs hinweg zeigen, dass OScaR bestehende Methoden durchgängig übertrifft und unter INT2-Quantisierung eine nahezu verlustfreie Leistung erzielt, was es als ein robustes, komplexitätsarmes und universelles Framework etabliert, das eine neue Pareto-Front definiert. Im Vergleich mit der BF16-FlashDecoding-v2-Baseline erreicht unsere OScaR-Implementierung eine bemerkenswerte Beschleunigung des Dekodierens um bis zu 3,0x, reduziert den Speicherbedarf um das 5,3-fache und erhöht den Durchsatz um das 4,1-fache. Der Code für OScaR ist öffentlich verfügbar unter https://github.com/ZunhaiSu/OScaR-KV-Quant.

English

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.