EpiCache：面向長對話問答的片段式鍵值快取管理

摘要

大型語言模型（LLMs）的最新進展已擴展了上下文長度，使助手能夠維持長時間的對話歷史，從而提供連貫且個性化的回應。然而，這種能力依賴於鍵值（KV）快取，其記憶體隨著對話長度線性增長，在嚴格的資源限制下迅速成為主導。減少這一開銷的活躍研究方向是KV快取壓縮，旨在限制快取大小同時保持準確性。然而，現有方法面臨兩個主要限制：（i）在全上下文預填充後驅逐條目會導致無界的峰值記憶體，以及（ii）依賴於查詢的驅逐將快取縮小到單一查詢，導致在多輪對話中準確性下降。我們提出了EpiCache，這是一個在固定記憶體預算下用於長對話問答（LongConvQA）的無訓練KV快取管理框架。EpiCache通過分塊預填充來限制快取增長，並通過情景KV壓縮來保留與主題相關的上下文，該方法將對話歷史聚類為連貫的情景並應用情景特定的KV快取驅逐。我們進一步設計了一種自適應的分層預算分配策略，該策略測量每層對驅逐的敏感性，並據此分配記憶體預算。在三個LongConvQA基準測試中，EpiCache相比近期基線將準確性提高了高達40%，在4-6倍壓縮下保持了接近完整的KV準確性，並將延遲和記憶體分別減少了高達2.4倍和3.5倍，從而在嚴格的資源限制下實現了高效的多輪互動。

English

Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

EpiCache：面向長對話問答的片段式鍵值快取管理

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

摘要

Support