KVarN: Variantie-genormaliseerde KV-cache-kwantisering beperkt foutaccumulatie in redeneertaken

Samenvatting

Test-time schaling is een krachtige aanpak om betere redenering in grote taalmodellen te verkrijgen, maar het wordt een geheugenknelpunt tijdens decodering met lange horizon, naarmate de KV-cache groeit. KV-cache-kwantisatie kan dit helpen verbeteren, maar huidige methoden worden geëvalueerd onder prefill-achtige instellingen en fouten gedragen zich anders onder autoregressieve decodering. We tonen aan dat in het laatste regime kwantisatiefouten zich over tijdstappen accumuleren, voornamelijk gedreven door onjuiste tokenschalen. We introduceren KVarN, een kalibratievrije KV-cache-kwantiseerder die een Hadamard-rotatie toepast gevolgd door een variantienormalisatie met dubbele schaling over beide assen van de K- en V-matrices. We vinden dat deze combinatie uitschietende token-schaalfouten corrigeert en foutaccumulatie aanzienlijk vermindert ten opzichte van bestaande baselines. KVarN vestigt een nieuwe state-of-the-art voor KV-cache-kwantisatie op generatieve benchmarks, waaronder MATH500, AIME24 en HumanEval, bij 2-bits precisie. Een vLLM-implementatie van de KVarN-methode is beschikbaar op https://github.com/huawei-csl/KVarN

English

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN