IceCache: 長系列LLMのためのメモリ効率の良いKVキャッシュ管理

要旨

Key-Value（KV）キャッシュは、中間的なアテンション状態を保存し、自己回帰生成における冗長な計算を回避することで、大規模言語モデル（LLM）の推論を高速化する上で極めて重要な役割を果たす。しかし、そのメモリ使用量はシーケンス長に比例して増加するため、リソースが限られたハードウェア上では深刻なメモリボトルネックを引き起こすことが多い。従来の研究では、KVキャッシュをCPUにオフロードし、GPU上には一部のサブセットのみを保持する手法が検討されてきたが、これらのアプローチは不正確なトークン選択に依存することが多く、連鎖思考推論のような長文生成タスクでは性能低下が生じる。本論文では、新しいKVキャッシュ管理戦略であるIceCacheを提案する。これは、意味的なトークンクラスタリングとPagedAttentionを統合したものである。意味的に関連するトークンを連続したメモリ領域に編成し、階層的で動的に更新可能なデータ構造によって管理することで、本手法は、CPU-GPU間転送時のより効率的なトークン選択とメモリ帯域幅の活用を実現する。LongBenchにおける実験結果から、256トークンのバジェット条件下で、IceCacheは完全なKVキャッシュモデルが達成する元の精度の99%を維持することが示された。さらに、他のオフロードベースの手法と比較して、IceCacheはKVキャッシュのトークンバジェットをわずか25%使用するだけで、遅延と精度において同等あるいは優れた性能を達成し、長シーケンスシナリオにおけるその有効性を実証している。コードはプロジェクトウェブサイト（https://yuzhenmao.github.io/IceCache/）で公開されている。

English

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.

IceCache: 長系列LLMのためのメモリ効率の良いKVキャッシュ管理

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

要旨

Support