面向RoPE感知的KV缓存量化比特分配
RoPE-Aware Bit Allocation for KV-Cache Quantization
June 23, 2026
作者: Fengfeng Liang, Yuechen Zhang, Jiaya Jia
cs.AI
摘要
现有低比特KV缓存量化器通常将每个缓存的键视为扁平向量。然而,在旋转位置编码(RoPE)下,键对未来注意力logit的贡献可分解为基于位置的二维频率块之和。这使得键缓存量化成为一个块级位分配问题:高能RoPE块对量化误差更敏感,应分配更多比特。我们提出Block-GTQ,一种基于TurboQuant-MSE(TQ-MSE)构建的、对RoPE感知的键缓存位分配器。对于每一层和KV头,Block-GTQ为每个RoPE块计算无标签能量分数,并通过边际增益贪心地分配整型位宽。在匹配的K/V位预算下,Block-GTQ在包含十个模型的诊断面板上更好地保留了RoPE查询-键logits,在2和3 b/dim的仅键量化条件下,每层平均绝对误差(MAE)降低32-80%,并在全部367/367层比较中优于均匀TQ-MSE。这些保真度提升转化为更强的下游长上下文检索、理解和推理能力。在Llama-3.1-8B-Instruct上采用K2V2配置时,Block-GTQ将六任务NIAH平均值从70.6提升至97.4,LongBench-EN平均值从36.87提升至53.31。在AIME 2024/2025上使用DeepSeek-R1-Distill-Qwen-7B,且无fp16近期键缓冲区时,Block-GTQ在K3V2配置下得分为51.7/37.5,接近fp16的54.2/37.9,而均匀TQ-MSE则崩塌至0.0/0.0。我们进一步实现了打包缓存服务路径。在单块H800 GPU上使用Qwen2.5-3B-Instruct,打包K3V3实现了3.24倍KV缓存压缩,质量与fp16相当,在128K上下文下比fp16 FlashAttention2快1.34倍,峰值内存从56.31 GB降至19.85 GB,并在fp16内存溢出的256K和512K上下文下仍保持可行。代码已开源至 https://github.com/JIA-Lab-research/blockgtq。
English
Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.