ChatPaper.aiChatPaper

OSCAR: 离线谱协方差感知旋转用于2比特KV缓存量化

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

May 18, 2026
作者: Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu
cs.AI

摘要

INT2 KV缓存量化对长上下文大语言模型服务极具吸引力,但实现高精度与可部署性仍面临挑战。诸如哈达玛变换等简单旋转操作虽能减少异常值,但由于未能与下游注意力机制对齐,在INT2精度下仍会出现性能退化。我们提出OSCAR——一种超低位KV缓存量化方法,该方法离线估计注意力感知的协方差结构,并据此推导固定旋转矩阵与量化裁剪阈值。通过这种方式,OSCAR将KV量化与注意力实际使用的协方差结构对齐。更重要的是,我们不仅提供了理论依据,还开发了完全可部署的OSCAR系统,该系统包含自定义INT2注意力核,能够兼容分页式KV缓存服务与融合核流水线,从而无缝集成到SGLang、vLLM等现代大语言模型服务框架中。 我们在包含最长32K标记推理轨迹的最新推理模型上,于5项任务中评估了该方法。在Qwen3-4B-Thinking-2507与Qwen3-8B模型上,OSCAR将BF16精度差距分别缩小至3.78和1.42个百分点,而朴素旋转INT2量化的精度几乎降至零。我们进一步将OSCAR扩展至Qwen3-32B与GLM-4.7(358B参数),其性能仍与BF16持平。在最长128K的长上下文RULER-NIAH任务中,OSCAR在两种Qwen3模型上均保持鲁棒性,而朴素旋转INT2量化则完全失效。系统层面,OSCAR将KV缓存内存减少约8倍,在相同内存预算下将大批量处理吞吐量提升高达7倍,同时由于内存带宽开销降低,单批次解码速度相比BF16提升高达3倍。
English
INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.