ChatPaper.aiChatPaper

XQuant:通过KV缓存重计算突破大语言模型推理的内存瓶颈

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

August 14, 2025
作者: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
cs.AI

摘要

尽管大型语言模型(LLM)推理已成为众多下游应用中的关键工作负载,但由于其庞大的内存占用和带宽需求,高效地进行LLM推理仍面临挑战。与此同时,过去几十年间,计算能力的提升速度持续超越内存容量和带宽的增长,这一趋势在现代GPU硬件中依然明显,并加剧了LLM推理的难度。因此,新兴算法正通过增加计算量来换取内存操作的减少。为此,我们提出了XQuant,它充分利用了这一趋势,通过低比特量化实现了内存消耗的显著降低,相较于最先进的键值(KV)缓存量化方法,在准确性上具有显著优势。我们通过量化并缓存层输入激活值X,而非采用标准的KV缓存,并在推理过程中即时重构键和值,从而实现了相比KV缓存立即节省2倍内存的效果。应用XQuant后,我们实现了高达约7.7倍的内存节省,且与FP16基线相比,困惑度下降小于0.1。此外,我们的方法利用了X值在层间相似的特点。基于这一观察,我们进一步引入了XQuant-CL,它利用X嵌入中的跨层相似性进行极致压缩。在不同模型中,XQuant-CL相对于FP16基线实现了高达10倍的内存节省,且困惑度仅下降0.01;或在困惑度仅下降0.1的情况下,实现了12.5倍的内存节省。XQuant通过利用硬件平台快速提升的计算能力,消除了内存瓶颈,同时超越了最先进的KV缓存量化方法,在多种模型中实现了接近FP16的精度。
English
Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2times memory savings compared to KV caching. By applying XQuant, we achieve up to sim 7.7times memory savings with <0.1 perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10times memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5times memory savings with only 0.1 perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.
PDF312August 18, 2025