OScaR: 大语言模型及更广范围内极端KV缓存量化的奥卡姆剃刀
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
May 19, 2026
作者: Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong
cs.AI
摘要
向长上下文推理和多模态智能的快速发展,使得键值缓存的存储占用成为高效部署的主要瓶颈。尽管成熟的每通道量化方法能有效适应键张量中固有的通道级离群值,但在极端压缩下其有效性显著降低。本研究从实证和理论双重角度重新审视了每通道量化范式的固有局限性。我们的分析发现,Token范数不平衡是制约量化精度的首要瓶颈。研究表明,当共享量化参数需要覆盖具有显著范数差异的Token组时,TNI会系统性地放大误差。与依赖复杂量化流程的方案不同,我们提出了OScaR——一种适用于多模态大语言模型的精准轻量级KV缓存压缩框架。该框架改进了每通道量化范式,通过通道化旋转与全Token缩放来高效抑制TNI引发的序列维度方差,并辅以优化的系统设计和CUDA内核。跨X-LLMs的广泛评估表明,OScaR持续优于现有方法,在INT2量化下实现近无损性能,成为定义新帕累托前沿的鲁棒、低复杂度通用框架。与BF16 FlashDecoding-v2基线相比,OScaR在解码阶段实现最高3.0倍加速,内存占用减少5.3倍,吞吐量提升4.1倍。OScaR代码已开源:https://github.com/ZunhaiSu/OScaR-KV-Quant
English
The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.