IceCache：面向长序列大语言模型的内存高效键值缓存管理方案

摘要

键值对（KV）缓存通过存储注意力中间状态并避免自回归生成中的冗余计算，对加速大语言模型推理具有关键作用。然而其内存占用量随序列长度线性增长，常导致资源受限硬件出现严重内存瓶颈。现有研究尝试将KV缓存卸载至CPU而仅保留子集于GPU，但这类方法常依赖粗略的令牌选择，在思维链推理等长序列生成任务中表现不佳。本文提出新型KV缓存管理策略IceCache，将语义令牌聚类与分页注意力机制相结合。通过把语义关联的令牌组织到由动态可更新分层数据结构管理的连续内存区域，本方法在CPU-GPU传输过程中实现了更高效的令牌选择和内存带宽利用。LongBench上的实验表明：在256个令牌的预算下，IceCache可保持全量KV缓存模型99%的原始准确率。相较于其他基于卸载的方法，IceCache仅需25%的KV缓存令牌预算即可达到相当甚至更优的延迟与准确率，证明了其在长序列场景下的有效性。代码已发布于项目网站https://yuzhenmao.github.io/IceCache/。

English

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.