CONF-KV: 混合精度ストレージを用いた長期的な大規模言語モデル向け信頼度認識型KVキャッシュ退避

要旨

長時間にわたるLLM推論において、キー・バリュー（KV）キャッシュはGPUメモリの支配的な消費源となり、トークンごとのアテンションのコストはますます高くなっている。多くの一般的な退避ポリシーは静的な再帰性ウィンドウや過去のアテンションに依存しており、各デコードステップで計算されるシグナル、すなわちモデルの現在の不確実性が活用されていない。本稿では、CONF-KVを提案する。これは、次トークン分布をスカラーの信頼度スコアに変換し、それに基づいてステップごとのキャッシュ予算を決定するKVキャッシュ管理手法であり、モデルが不確かな場合はより多くのコンテキストを保持し、確信がある場合は積極的に削減する。各予算内では、累積アテンション質量と再帰性の複合指標でトークンをランク付けし、保護された最近のウィンドウにより局所的な一貫性を維持する。本ポリシーは、ブロック単位のオンラインソフトマックスアテンション、FP16/INT8混在ストレージ、およびピラミッド型のレイヤー別予算バリアントと組み合わせる。4つのモデルファミリーと最大4Kの生成長において、CONF-KVは固定512トークンスライディングウィンドウと同等のフットプリントを維持しつつ、フルKVとの困惑度差は1.5～2.1ポイント以内に収まる。最大32KトークンのNeedle-in-a-Haystackタスクでは、CONF-KVの検索精度は91.4%であり、スライディングウィンドウの53.8%、H2Oの80.6%を上回る。75のVisualWebArenaタスクでは、CONF-KVはフルKVの成功率の95.3%を保持し、ピークメモリを2.8倍削減する。

English

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.