EpiCache：面向长对话问答的情景式键值缓存管理

摘要

近期，大语言模型（LLMs）在上下文长度上的突破，使得智能助手能够维持长对话历史，从而提供连贯且个性化的响应。然而，这一能力依赖于键值（KV）缓存，其内存随对话长度线性增长，在严格资源限制下迅速成为瓶颈。为降低这一开销，KV缓存压缩成为研究热点，旨在限制缓存大小同时保持准确性。但现有方法面临两大局限：（一）全上下文预填充后逐出条目导致峰值内存无界；（二）依赖查询的逐出策略将缓存局限于单一查询，导致多轮对话准确性下降。我们提出EpiCache，一种无需训练的KV缓存管理框架，专为固定内存预算下的长对话问答（LongConvQA）设计。EpiCache通过分块预填充限制缓存增长，并采用情节式KV压缩保留话题相关上下文，即将对话历史聚类为连贯情节并实施情节特定的KV缓存逐出。此外，我们设计了一种自适应分层预算分配策略，通过衡量各层对逐出的敏感度，相应分配内存预算。在三个LongConvQA基准测试中，EpiCache较近期基线模型提升准确性最高达40%，在4-6倍压缩下保持接近全KV的准确性，并将延迟和内存分别减少最多2.4倍和3.5倍，从而在严格资源限制下实现高效的多轮交互。

English

Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

EpiCache：面向长对话问答的情景式键值缓存管理

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

摘要

Support