OScaR: LLMとその先における極端なKVキャッシュ量子化のためのオッカムの剃刀

要旨

長距離コンテキスト推論とマルチモーダル知能への急速な進展により、Key-Value (KV) キャッシュのメモリフットプリントは、効率的なデプロイにおける主要なメモリボトルネックとなっています。確立されたチャネル単位の量子化は、Keyテンソルに内在するチャネル単位の外れ値を効果的に処理しますが、極度の圧縮下ではその効果が低下します。本研究では、チャネル単位の量子化パラダイムに内在する限界を、経験的および理論的両面から再検討します。我々の分析は、量子化の忠実性に対する主要なボトルネックとして、トークン・ノルム不均衡 (TNI) を特定します。共有量子化パラメータが、大幅なノルムのばらつきを示すトークングループに適用される必要がある場合、TNIが系統的に誤差を増幅することを実証します。複雑な量子化パイプライン（例: TurboQuant）に依存する代わりに、我々はX-LLM（テキストのみ、マルチモーダル、およびオムニモーダルLLM）向けの正確かつ軽量なKVキャッシュ圧縮フレームワークであるOScaR（Omni-Scaled Canalized Rotation）を提案します。チャネル単位のパラダイムを発展させたOScaRは、Canalized Rotationとそれに続くOmni-Token Scalingを採用し、TNIに起因する系列次元の分散を効果的かつ効率的に軽減します。この手法は、最適化されたシステム設計とCUDAカーネルによってさらに支えられています。X-LLMにわたる広範な評価により、OScaRは既存手法を一貫して上回り、INT2量子化においてほぼロスレスな性能を達成し、新たなパレート最前線を定義する、堅牢で低複雑性、かつ普遍的なフレームワークであることが示されました。BF16 FlashDecoding-v2ベースラインと比較して、我々のOScaR実装は、デコードで最大3.0倍の高速化、メモリフットプリントを5.3倍削減、スループットを4.1倍向上させるという顕著な成果を達成しています。OScaRのコードはhttps://github.com/ZunhaiSu/OScaR-KV-Quantで公開されています。

English

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.