IceCache：面向长序列大语言模型的内存高效键值缓存管理方案

摘要

键值对（KV）缓存通过存储中间注意力状态并避免自回归生成过程中的冗余计算，在加速大语言模型推理方面发挥着关键作用。然而其内存占用量会随序列长度线性增长，常常在资源受限的硬件上引发严重的内存瓶颈。现有研究尝试将KV缓存卸载至CPU而仅在GPU上保留子集，但这些方法往往依赖不精确的令牌选择，在思维链推理等长文本生成任务中会出现性能下降。本文提出了一种新型KV缓存管理策略IceCache，该策略将语义令牌聚类与分页注意力机制相结合。通过将语义相关的令牌组织到由动态可更新的分层数据结构管理的连续内存区域，我们的方法能够在CPU-GPU传输过程中实现更高效的令牌选择和更好的内存带宽利用率。LongBench上的实验结果表明，在256个令牌的预算下，IceCache可保持完整KV缓存模型99%的原始准确率。相较于其他基于卸载的方法，IceCache仅使用25%的KV缓存令牌预算即可达到相当甚至更优的延迟与准确率，彰显了其在长序列场景下的有效性。相关代码已发布于项目网站https://yuzhenmao.github.io/IceCache/。

English

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.