推理时超扩展与KV缓存压缩

摘要

推理时扩展通过生成长度更大或更并行的序列，以牺牲效率为代价提升推理准确性。然而，在Transformer大语言模型（LLMs）中，生成成本的关键瓶颈在于键值（KV）缓存的大小，而非生成的令牌数量。因此，我们探索了推理时的超扩展技术：通过压缩KV缓存，我们能在相同的计算预算内生成更多令牌，从而进一步提升扩展推理的准确性。但这一方法的成功，关键在于压缩技术能否在高压缩比下仍保持准确性。为使超扩展技术实用化，我们提出了动态记忆稀疏化（DMS），这是一种新颖的KV缓存稀疏化方法，仅需1K训练步骤即可实现8倍压缩，同时保持比无需训练的稀疏注意力更高的准确性。DMS不急于丢弃缓存令牌，而是延迟令牌淘汰，隐式合并表示并保留关键信息。我们通过DMS在多个LLM家族上验证了推理时超扩展的有效性，表明其在相近的推理运行时间和内存负载下提升了准确性。例如，在AIME 24上，我们将Qwen-R1 32B平均提升了9.1分，在GPQA上提升了7.6分，在LiveCodeBench上提升了9.6分，覆盖了不同的计算预算。

English

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8times compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

推理时超扩展与KV缓存压缩

Inference-Time Hyper-Scaling with KV Cache Compression

摘要

Support