CONF-KV:面向长时域大语言模型的置信度感知KV缓存逐出与混合精度存储
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
May 24, 2026
作者: Yubo Li, Yidi Miao
cs.AI
摘要
長序列LLM推理使鍵值(KV)快取成為GPU記憶體的主要消耗者,並使每個詞元的注意力計算日益昂貴。許多常見的淘汰策略依賴於靜態近期視窗或歷史注意力,卻忽略了每個解碼步驟中可用的訊號:模型當下的不確定性。我們提出CONF-KV,一種KV快取管理器,它將下一個詞元的分佈轉換為標量置信度分數,並以此決定每步的快取預算——當模型不確定時保留更多上下文,當模型自信時則積極修剪。在每個預算內,詞元根據累積注意力權重與近期性的複合指標進行排序,同時受保護的近期視窗維持局部連貫性。我們將此策略與區塊式線上softmax注意力、混合FP16/INT8儲存以及金字塔式逐層預算變體相結合。在四個模型系列及生成長度達4K的實驗中,CONF-KV的記憶體佔用接近固定512詞元滑動視窗,同時困惑度僅距完整KV 1.5至2.1個百分點。在長達32K詞元的「大海撈針」任務中,CONF-KV達到91.4%的檢索準確率,而滑動視窗為53.8%,H2O為80.6%;在75項VisualWebArena任務中,CONF-KV保留了完整KV 95.3%的成功率,同時峰值記憶體降低了2.8倍。
English
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.