EpiCache:面向長對話問答的片段式鍵值快取管理
EpiCache: Episodic KV Cache Management for Long Conversational Question Answering
September 22, 2025
作者: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho
cs.AI
摘要
大型語言模型(LLMs)的最新進展已擴展了上下文長度,使助手能夠維持長時間的對話歷史,從而提供連貫且個性化的回應。然而,這種能力依賴於鍵值(KV)快取,其記憶體隨著對話長度線性增長,在嚴格的資源限制下迅速成為主導。減少這一開銷的活躍研究方向是KV快取壓縮,旨在限制快取大小同時保持準確性。然而,現有方法面臨兩個主要限制:(i)在全上下文預填充後驅逐條目會導致無界的峰值記憶體,以及(ii)依賴於查詢的驅逐將快取縮小到單一查詢,導致在多輪對話中準確性下降。我們提出了EpiCache,這是一個在固定記憶體預算下用於長對話問答(LongConvQA)的無訓練KV快取管理框架。EpiCache通過分塊預填充來限制快取增長,並通過情景KV壓縮來保留與主題相關的上下文,該方法將對話歷史聚類為連貫的情景並應用情景特定的KV快取驅逐。我們進一步設計了一種自適應的分層預算分配策略,該策略測量每層對驅逐的敏感性,並據此分配記憶體預算。在三個LongConvQA基準測試中,EpiCache相比近期基線將準確性提高了高達40%,在4-6倍壓縮下保持了接近完整的KV準確性,並將延遲和記憶體分別減少了高達2.4倍和3.5倍,從而在嚴格的資源限制下實現了高效的多輪互動。
English
Recent advances in large language models (LLMs) have extended context
lengths, enabling assistants to sustain long histories for coherent,
personalized responses. This ability, however, hinges on Key-Value (KV)
caching, whose memory grows linearly with dialogue length and quickly dominates
under strict resource constraints. An active line of research for reducing this
overhead is KV cache compression, which seeks to limit cache size while
preserving accuracy. Yet existing methods face two major limitations: (i)
evicting entries after full-context prefill causes unbounded peak memory, and
(ii) query-dependent eviction narrows the cache to a single query, leading to
degraded accuracy in multi-turn conversations. We introduce EpiCache, a
training-free KV cache management framework for long conversational question
answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth
through block-wise prefill and preserves topic-relevant context via episodic KV
compression, which clusters conversation history into coherent episodes and
applies episode-specific KV cache eviction. We further design an adaptive
layer-wise budget allocation strategy that measures each layer's sensitivity to
eviction and distributes the memory budget across layers accordingly. Across
three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over
recent baselines, sustains near-full KV accuracy under 4-6x compression, and
reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient
multi-turn interaction under strict resource constraints.