IceCache: 장기 시퀀스 LLM을 위한 메모리 효율적인 KV 캐시 관리

초록

키-값(Key-Value, KV) 캐시는 자동회귀 생성 과정에서 중간 어텐션 상태를 저장하고 반복적인 계산을 피함으로써 대규모 언어 모델(Large Language Models, LLMs)의 추론 속도를 가속화하는 데 중요한 역할을 합니다. 그러나 KV 캐시의 메모리 사용량은 시퀀스 길이에 따라 선형적으로 증가하므로, 자원이 제한된 하드웨어에서 심각한 메모리 병목 현상을 초래하는 경우가 많습니다. 기존 연구에서는 KV 캐시를 CPU로 오프로딩하고 일부만 GPU에 유지하는 방식을 탐구했지만, 이러한 접근법은 대체로 부정확한 토큰 선택에 의존하며 사고 연쇄(chain-of-thought) 추론과 같은 장문 생성 작업에서 성능 저하를 겪습니다. 본 논문에서는 의미적 토큰 클러스터링과 PagedAttention을 통합한 새로운 KV 캐시 관리 전략인 IceCache를 제안합니다. 의미적으로 관련된 토큰을 계층적이며 동적으로 업데이트 가능한 자료 구조로 관리되는 연속 메모리 영역으로 구성함으로써, 본 방법론은 CPU-GPU 간 전송 동안 더 효율적인 토큰 선택과 메모리 대역폭 활용도를 가능하게 합니다. LongBench에서의 실험 결과에 따르면, 256토큰 예산 조건에서 IceCache는 전체 KV 캐시 모델이 달성한 원본 정확도의 99%를 유지합니다. 또한, 다른 오프로딩 기반 방법론들과 비교했을 때 IceCache는 KV 캐시 토큰 예산의 25%만 사용하면서도 경쟁력 있거나 더 우수한 지연 시간과 정확도를 달성하여, 장문 시퀀스 시나리오에서의 효과성을 입증했습니다. 코드는 https://yuzhenmao.github.io/IceCache/ 프로젝트 웹사이트에서 확인할 수 있습니다.

English

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.

IceCache: 장기 시퀀스 LLM을 위한 메모리 효율적인 KV 캐시 관리

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

초록

Support