RoPE를 고려한 KV-캐시 양자화 비트 할당

초록

기존의 저비트 KV-캐시 양자화기들은 종종 각 캐시된 키를 평평한 벡터로 처리합니다. 그러나 RoPE 하에서, 키가 미래의 어텐션 로짓에 기여하는 방식은 위치에 의존하는 2차원 주파수 블록들에 대한 합으로 분해됩니다. 이는 키-캐시 양자화를 블록 단위 비트 할당 문제로 만듭니다: 고에너지 RoPE 블록은 양자화 오류에 더 민감하므로 더 많은 비트를 할당해야 합니다. 우리는 TurboQuant-MSE(TQ-MSE)를 기반으로 구축된 RoPE 인식 비트 할당기인 Block-GTQ를 소개합니다. 각 레이어와 KV 헤드에 대해 Block-GTQ는 각 RoPE 블록에 대한 레이블 없는 에너지 점수를 계산하고 한계 이득에 따라 정수 비트 폭을 탐욕적으로 할당합니다. 일치된 K/V 비트 예산 하에서 Block-GTQ는 10개 모델 진단 패널에서 RoPE 쿼리-키 로짓을 더 잘 보존하며, 2 및 3 비트/차원 K-only 양자화에서 레이어당 MAE를 32-80% 감소시키고 균일 TQ-MSE와의 모든 367/367 레이어 비교에서 승리합니다. 이러한 충실도 향상은 더 강력한 다운스트림 장기 컨텍스트 검색, 이해 및 추론으로 이어집니다. Llama-3.1-8B-Instruct에서 K2V2 설정 하에, Block-GTQ는 6개 과제 NIAH 평균을 70.6에서 97.4로, LongBench-EN 평균을 36.87에서 53.31로 향상시킵니다. DeepSeek-R1-Distill-Qwen-7B를 사용한 AIME 2024/2025에서, fp16 최근 키 버퍼 없이, Block-GTQ K3V2는 51.7/37.5를 기록하여 fp16의 54.2/37.9에 근접한 반면, 균일 TQ-MSE는 0.0/0.0으로 붕괴됩니다. 우리는 추가로 압축 캐시 서빙 경로를 구현합니다. Qwen2.5-3B-Instruct를 사용한 단일 H800 GPU에서, 압축 K3V3은 fp16과 비슷한 품질로 3.24배 KV-캐시 압축을 달성하고, 128K 컨텍스트에서 fp16 FlashAttention2보다 1.34배 빠르게 실행되며, 최대 메모리를 56.31 GB에서 19.85 GB로 줄이고, fp16이 OOM이 발생하는 256K 및 512K에서도 실행 가능합니다. 코드는 https://github.com/JIA-Lab-research/blockgtq에서 확인할 수 있습니다.

English

Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.