LookaheadKV: 生成を伴わない将来予測による高速かつ高精度なKVキャッシュ削除

要旨

Transformerベースの大規模言語モデル（LLM）は、自己回帰的な推論における冗長な計算を回避するためにキー・バリュー（KV）キャッシングに依存している。この機構は効率性を大幅に向上させるが、キャッシュサイズは入力系列長に比例して線形的に増大し、長文脈タスクでは迅速にボトルネックとなる。既存の解決策は、推定された重要度スコアに基づいて重要でないと判断されたプロンプトのKVを削除することでこの問題を緩和する。特に、最近の一連の研究は「未来を覗き見る」ことで削除の質を向上させることを提案している。これは、ドラフト生成器が目標モデルの真の応答を近似する代理未来応答を生成し、この代理応答を用いてキャッシュされたKVの重要度をより正確に推定するというものである。しかし、これらのアプローチは計算コストの高いドラフト生成に依存しており、大幅なプリフィリングのオーバーヘッドを導入し、実際の展開における実用性を制限している。この課題に対処するため、我々は明示的なドラフト生成を必要とせずに代理未来応答の強みを活用する軽量な削除フレームワーク、LookaheadKVを提案する。LookaheadKVは、Transformer層にパラメータ効率の良いモジュールを追加し、真の重要度スコアを高精度で予測するように訓練する。我々の設計は、既存の低コストなヒューリスティック手法に匹敵する無視できる実行時オーバーヘッドを保証しつつ、より高コストな近似手法を上回る精度を達成する。様々なモデルにわたる長文脈理解ベンチマークでの大規模な実験により、本手法が様々な長文脈理解タスクにおいて最近の競合ベースラインを性能で凌駕するだけでなく、削除コストを最大14.5倍削減し、Time-to-First-Tokenを大幅に高速化することを実証した。コードはhttps://github.com/SamsungLabs/LookaheadKV で公開されている。

English

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

LookaheadKV: 生成を伴わない将来予測による高速かつ高精度なKVキャッシュ削除

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

要旨

Support