前瞻性键值缓存淘汰策略：无需生成即可预判未来的快速精准KV缓存清理技术

摘要

基於Transformer架構的大規模語言模型在自迴歸推理過程中依賴鍵值緩存來避免冗餘計算。儘管該機制顯著提升了效率，但緩存大小會隨輸入序列長度線性增長，迅速成為長上下文任務的瓶頸。現有解決方案通過驅逐被判定為不重要的提示鍵值對來緩解此問題，其決策依據是預估的重要性分數。值得注意的是，近期研究提出通過"預覽未來"提升驅逐質量：先由草稿生成器產生近似目標模型真實響應的代理未來響應，再利用該代理響應更精確地估算緩存鍵值對的重要性。然而，這類方法依賴計算成本高昂的草稿生成，會引入大量預填充開銷，限制其實際部署可行性。為應對此挑戰，我們提出LookaheadKV——一種輕量級驅逐框架，既能利用代理未來響應的優勢，又無需顯式生成草稿。LookaheadKV通過為Transformer層級添加參數高效的模塊，以高精度預測真實重要性分數。該設計在保持與現有低成本啟發式方法相當的運行時開銷同時，實現了優於高成本近似方法的準確性。在跨多種模型的長上下文理解基準測試中，大量實驗表明我們的方法不僅在多項長上下文理解任務中超越近期競爭基線，還將驅逐成本降低最高達14.5倍，從而顯著縮短首令牌生成時間。代碼已開源於https://github.com/SamsungLabs/LookaheadKV。

English

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

前瞻性键值缓存淘汰策略：无需生成即可预判未来的快速精准KV缓存清理技术

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

摘要

Support