KVarN: 분산 정규화된 KV-캐시 양자화가 추론 작업에서 오류 누적을 완화한다

초록

테스트 시간 스케일링은 대규모 언어 모델에서 더 나은 추론을 얻기 위한 강력한 접근 방식이지만, KV-캐시가 증가함에 따라 긴 시간 범위 디코딩에서 메모리 병목 현상이 발생한다. KV-캐시 양자화는 이를 개선하는 데 도움이 될 수 있지만, 현재 방법들은 프리필(prefill)과 유사한 설정에서 평가되며, 오류는 자기회귀 디코딩에서 다르게 동작한다. 후자의 경우, 양자화 오류가 주로 잘못된 토큰 스케일에 의해 주도되며 시간 단계에 걸쳐 누적된다는 점을 보여준다. 우리는 Hadamard 회전과 K 및 V 행렬의 두 축에 걸친 이중 스케일링 분산 정규화를 결합한 교정 불필요(calibration-free) KV-캐시 양자화기인 KVarN을 소개한다. 이러한 조합이 이상치 토큰 스케일 오류를 수정하고 기존 기준선에 비해 오류 누적을 상당히 줄인다는 것을 발견했다. KVarN은 MATH500, AIME24 및 HumanEval을 포함한 생성 벤치마크에서 2비트 정밀도로 KV-캐시 양자화에 대한 새로운 최고 수준을 확립한다. KVarN 방법의 vLLM 구현은 https://github.com/huawei-csl/KVarN에서 확인할 수 있다.

English

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN