推理時期的超縮放與KV快取壓縮
Inference-Time Hyper-Scaling with KV Cache Compression
June 5, 2025
作者: Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti
cs.AI
摘要
推理时扩展通过生成更长或更并行的序列,以效率换取更高的推理准确性。然而,在Transformer大型语言模型(LLMs)中,生成成本的关键瓶颈在于键值(KV)缓存的大小,而非生成的令牌数量。因此,我们探索了推理时的超扩展:通过压缩KV缓存,我们可以在相同的计算预算内生成更多令牌,并进一步提升扩展推理的准确性。然而,这一方法的成功关键在于压缩方法能否在高压缩率下仍保持准确性。为使超扩展实用化,我们引入了动态记忆稀疏化(DMS),这是一种新颖的KV缓存稀疏化方法,仅需1K训练步骤即可实现8倍压缩,同时保持比无需训练的稀疏注意力更高的准确性。DMS并非过早丢弃缓存的令牌,而是延迟令牌的移除,隐式合并表示并保留关键信息。我们在多个LLM家族上展示了结合DMS的推理时超扩展的有效性,证明其在相近的推理运行时间和内存负载下提升了准确性。例如,在AIME 24上,我们将Qwen-R1 32B的平均得分提高了9.1分,在GPQA上提高了7.6分,在LiveCodeBench上提高了9.6分,跨越了不同的计算预算。
English
Inference-time scaling trades efficiency for increased reasoning accuracy by
generating longer or more parallel sequences. However, in Transformer LLMs,
generation cost is bottlenecked by the size of the key-value (KV) cache, rather
than the number of generated tokens. Hence, we explore inference-time
hyper-scaling: by compressing the KV cache, we can generate more tokens within
the same compute budget and further improve the accuracy of scaled inference.
The success of this approach, however, hinges on the ability of compression
methods to preserve accuracy even at high compression ratios. To make
hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a
novel method for sparsifying KV caches that only requires 1K training steps to
achieve 8times compression, while maintaining better accuracy than
training-free sparse attention. Instead of prematurely discarding cached
tokens, DMS delays token eviction, implicitly merging representations and
preserving critical information. We demonstrate the effectiveness of
inference-time hyper-scaling with DMS on multiple families of LLMs, showing
that it boosts accuracy for comparable inference runtime and memory load. For
instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on
GPQA, and 9.6 on LiveCodeBench across compute budgets.