ChatPaper.aiChatPaper

前瞻性键值缓存淘汰策略:无需生成即可精准预测未来访问模式

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

March 11, 2026
作者: Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon
cs.AI

摘要

基于Transformer的大语言模型依赖键值缓存机制来避免自回归推理中的冗余计算。尽管该机制显著提升了效率,但缓存大小会随输入序列长度线性增长,在长上下文任务中迅速成为性能瓶颈。现有解决方案通过根据预估重要性分数淘汰被认为不重要的提示键值缓存来缓解此问题。值得注意的是,近期研究提出通过"前瞻未来"提升淘汰质量——即使用草稿生成器产生近似目标模型真实响应的代理未来响应,进而更精准评估缓存键值的重要性。然而这类方法依赖计算成本高昂的草稿生成,会带来显著的预填充开销,限制其实际部署价值。 为解决这一难题,我们提出LookaheadKV轻量化淘汰框架,该框架无需显式草稿生成即可利用代理未来响应的优势。LookaheadKV通过参数高效的增强模块改造Transformer层,这些模块经训练能以高精度预测真实重要性分数。我们的设计在保持与现有低成本启发式方法相当的可忽略运行时开销的同时,实现了优于高成本近似方法的准确性。在多种模型的长上下文理解基准测试中,大量实验表明我们的方法不仅在各项长上下文理解任务中超越近期竞争基线,还将淘汰成本降低高达14.5倍,显著缩短首令牌生成时间。代码已开源:https://github.com/SamsungLabs/LookaheadKV。
English
Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.
PDF62March 30, 2026