ChatPaper.aiChatPaper

OScaR:面向大型語言模型及更廣泛領域中極端KV緩存量化的奧卡姆剃刀

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

May 19, 2026
作者: Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong
cs.AI

摘要

朝向長上下文推理與多模態智慧的快速進展,使得鍵值(KV)快取的記憶體佔用成為高效部署的主要記憶體瓶頸。儘管已建立的每通道量化能有效應對鍵(Key)張量中內在的通道級異常值,但在極端壓縮下其效果會減弱。本研究從經驗與理論角度重新審視每通道量化範式的內在限制。我們的分析指出,**令牌範數不平衡(Token Norm Imbalance, TNI)** 是量化保真度的主要瓶頸。我們證明,當共享量化參數需涵蓋範數差異顯著的令牌群組時,TNI 會系統性地放大誤差。不同於依賴複雜量化流程(如 TurboQuant)的做法,我們提出 **OScaR(全尺度通道化旋轉,Omni-Scaled Canalized Rotation)**,這是一個針對 X-LLM(即純文字、多模態與全模態 LLM)的精準且輕量級 KV 快取壓縮框架。OScaR 在每通道量化範式基礎上,採用通道化旋轉搭配全域令牌縮放,有效且高效地減輕 TNI 引起的序列維度變異,並進一步透過最佳化的系統設計與 CUDA 核心提供支援。在 X-LLM 上進行的廣泛評估顯示,OScaR 持續優於現有方法,且在 INT2 量化下實現近乎無損的性能,成為一個穩健、低複雜度且通用的框架,並定義了新的 Pareto 前沿。與 BF16 FlashDecoding-v2 基線相比,我們的 OScaR 實現在解碼中獲得高達 3.0 倍的加速、記憶體佔用減少 5.3 倍,並使吞吐量提升 4.1 倍。OScaR 的程式碼公開於 https://github.com/ZunhaiSu/OScaR-KV-Quant。
English
The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.