OSCAR:用於2位元KV快取量化的離線頻譜協方差感知旋轉方法
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
May 18, 2026
作者: Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu
cs.AI
摘要
INT2 KV-cache 量化對於長上下文的大型語言模型服務頗具吸引力,但要同時實現高準確度與可部署性仍十分困難。簡單的旋轉(如 Hadamard 變換)能減少離群值,但由於未與下游注意力對齊,在 INT2 下仍會導致性能下降。我們提出 OSCAR,一種超低位元 KV 快取量化方法,該方法能離線估計注意力感知的共變異數結構,並據此推導出固定的旋轉與裁剪閾值來進行量化。如此一來,它使 KV 量化與注意力實際使用的共變異數結構對齊。更重要的是,我們不僅提供理論論證,還開發了完整的可部署 OSCAR 系統,內建自訂的 INT2 注意力核心,且與分頁 KV 快取服務及融合核心管線相容,能無縫整合至現代 LLM 服務框架(如 SGLang 與 vLLM)中。
我們在近期推出的推理模型上進行評估,這些模型的推理軌跡長達 32k 個 token,涵蓋 5 項任務。在 Qwen3-4B-Thinking-2507 與 Qwen3-8B 上,OSCAR 將與 BF16 的準確度差距分別縮小至 3.78 與 1.42 個百分點,而單純旋轉的 INT2 方法精確度幾乎歸零。我們進一步將 OSCAR 擴展至 Qwen3-32B 與 GLM-4.7(358B 參數),其表現仍與 BF16 相當。在長上下文任務(RULER-NIAH,最長 128K)中,OSCAR 在兩個 Qwen3 模型上均保持穩健,而單純旋轉的 INT2 方法則完全失效。從系統層面看,OSCAR 將 KV 快取記憶體減少約 8 倍,在相同記憶體預算下於大批次大小時吞吐量提升高達 7 倍,且因降低記憶體頻寬開銷,相較 BF16 可將批次大小為 1 的解碼加速至高達 3 倍。
English
INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM.
We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.