RoPEを考慮したKVキャッシュ量子化のためのビット割り当て
RoPE-Aware Bit Allocation for KV-Cache Quantization
June 23, 2026
著者: Fengfeng Liang, Yuechen Zhang, Jiaya Jia
cs.AI
要旨
现有低比特KV缓存量化器通常将每个缓存的键视为平面向量。然而,在RoPE(旋转位置编码)下,键对未来注意力logit的贡献可分解为二维频率块上与位置相关的累加和。这使得键缓存量化成为一个分块比特分配问题:高能量的RoPE块对量化误差更敏感,应分配更多比特。我们提出Block-GTQ,一种基于TurboQuant-MSE(TQ-MSE)的、具备RoPE感知能力的键缓存量化比特分配器。对于每一层和每个KV头,Block-GTQ为每个RoPE块计算无标签能量分数,并通过边际增益贪心地分配整数比特宽度。在匹配的K/V比特预算下,Block-GTQ在十个模型的诊断面板上更好地保留了RoPE查询-键logit,在2比特和3比特每维度K-only量化下将每层MAE降低32-80%,并在全部367/367层比较中优于均匀TQ-MSE。这些保真度优势转化为更强的下游长上下文检索、理解和推理能力。在Llama-3.1-8B-Instruct的K2V2设置下,Block-GTQ将六任务NIAH平均值从70.6提升至97.4,LongBench-EN平均值从36.87提升至53.31。在DeepSeek-R1-Distill-Qwen-7B的AIME 2024/2025任务中,不使用fp16近期键缓冲区的情况下,Block-GTQ在K3V2设置下得分51.7/37.5,接近fp16的54.2/37.9,而均匀TQ-MSE则崩溃至0.0/0.0。我们进一步实现了打包缓存服务路径。在单块H800 GPU上,Qwen2.5-3B-Instruct的打包K3V3实现3.24倍KV缓存压缩且质量与fp16相当,在128K上下文下运行速度比fp16 FlashAttention2快1.34倍,峰值内存从56.31 GB降至19.85 GB,并且在fp16内存溢出的256K和512K上下文中仍可运行。代码已开源至https://github.com/JIA-Lab-research/blockgtq。
English
Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.