EpiCache: 장기 대화형 질문 응답을 위한 에피소딕 키-값 캐시 관리

초록

최근 대규모 언어 모델(LLM)의 발전으로 컨텍스트 길이가 확장되어, 어시스턴트가 일관적이고 개인화된 응답을 위해 긴 대화 기록을 유지할 수 있게 되었습니다. 그러나 이러한 능력은 키-값(KV) 캐싱에 의존하며, 이 캐시의 메모리는 대화 길이에 따라 선형적으로 증가하여 엄격한 자원 제약 하에서 빠르게 지배적이 됩니다. 이러한 오버헤드를 줄이기 위한 활발한 연구 분야는 KV 캐시 압축으로, 캐시 크기를 제한하면서도 정확도를 유지하는 것을 목표로 합니다. 그러나 기존 방법은 두 가지 주요 한계에 직면해 있습니다: (i) 전체 컨텍스트 프리필 이후 항목을 제거하면 무한정의 피크 메모리가 발생하고, (ii) 쿼리 의존적 제거는 캐시를 단일 쿼리로 좁히기 때문에 다중 턴 대화에서 정확도가 저하됩니다. 우리는 고정 메모리 예산 하에서 장기 대화형 질문 응답(LongConvQA)을 위한 훈련이 필요 없는 KV 캐시 관리 프레임워크인 EpiCache를 소개합니다. EpiCache는 블록 단위 프리필을 통해 캐시 증가를 제한하고, 에피소드별 KV 압축을 통해 주제 관련 컨텍스트를 보존합니다. 이는 대화 기록을 일관된 에피소드로 클러스터링하고 에피소드별 KV 캐시 제거를 적용합니다. 또한, 각 레이어의 제거 민감도를 측정하고 메모리 예산을 레이어 간에 적응적으로 배분하는 전략을 설계했습니다. 세 가지 LongConvQA 벤치마크에서 EpiCache는 최근 기준선 대비 최대 40%의 정확도 향상을 보였으며, 4-6배 압축 하에서 거의 완전한 KV 정확도를 유지하고, 지연 시간과 메모리를 각각 최대 2.4배와 3.5배 줄여 엄격한 자원 제약 하에서도 효율적인 다중 턴 상호작용을 가능하게 합니다.

English

Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

EpiCache: 장기 대화형 질문 응답을 위한 에피소딕 키-값 캐시 관리

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

초록

Support