XQuant:通过KV缓存重计算突破大语言模型推理的内存瓶颈
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
August 14, 2025
作者: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
cs.AI
摘要
尽管大语言模型(LLM)推理已成为众多下游应用中的关键工作负载,但由于其庞大的内存占用和带宽需求,实现高效推理颇具挑战。与此同时,过去几十年间,计算能力的提升速度持续超越内存容量与带宽的增长,这一趋势在现代GPU硬件中尤为明显,进一步加剧了LLM推理的难度。因此,新兴算法正通过增加计算量来换取内存操作的减少。为此,我们提出了XQuant,它顺应这一趋势,通过低比特量化实现了内存消耗的显著降低,相较于最先进的KV缓存量化方法,在准确性上具有显著优势。我们通过量化并缓存层输入激活值X,而非采用标准的KV缓存,进而在推理过程中动态重构键(Keys)和值(Values),从而实现了相较于KV缓存立即节省2倍内存的效果。应用XQuant后,我们实现了高达约7.7倍的内存节省,且与FP16基线相比,困惑度下降小于0.1。此外,我们的方法利用了X值在层间相似的特点,基于此观察,我们进一步引入了XQuant-CL,它利用X嵌入的跨层相似性进行极致压缩。在不同模型中,XQuant-CL相对于FP16基线实现了高达10倍的内存节省,且困惑度仅增加0.01;在困惑度增加0.1的情况下,内存节省可达12.5倍。XQuant充分利用了硬件平台计算能力的快速提升,消除了内存瓶颈,同时超越了最先进的KV缓存量化方法,在多种模型上实现了接近FP16的精度。
English
Although LLM inference has emerged as a critical workload for many downstream
applications, efficiently inferring LLMs is challenging due to the substantial
memory footprint and bandwidth requirements. In parallel, compute capabilities
have steadily outpaced both memory capacity and bandwidth over the last few
decades, a trend that remains evident in modern GPU hardware and exacerbates
the challenge of LLM inference. As such, new algorithms are emerging that trade
increased computation for reduced memory operations. To that end, we present
XQuant, which takes advantage of this trend, enabling an order-of-magnitude
reduction in memory consumption through low-bit quantization with substantial
accuracy benefits relative to state-of-the-art KV cache quantization methods.
We accomplish this by quantizing and caching the layer input activations X,
instead of using standard KV caching, and then rematerializing the Keys and
Values on-the-fly during inference. This results in an immediate 2times
memory savings compared to KV caching. By applying XQuant, we achieve up to
sim 7.7times memory savings with <0.1 perplexity degradation compared to
the FP16 baseline. Furthermore, our approach leverages the fact that X values
are similar across layers. Building on this observation, we introduce
XQuant-CL, which exploits the cross-layer similarity in the X embeddings for
extreme compression. Across different models, XQuant-CL attains up to
10times memory savings relative to the FP16 baseline with only 0.01
perplexity degradation, and 12.5times memory savings with only 0.1
perplexity degradation. XQuant exploits the rapidly increasing compute
capabilities of hardware platforms to eliminate the memory bottleneck, while
surpassing state-of-the-art KV cache quantization methods and achieving
near-FP16 accuracy across a wide range of models.