CONF-KV：面向长时域大语言模型的置信度感知KV缓存逐出与混合精度存储

摘要

長序列LLM推理使鍵值（KV）快取成為GPU記憶體的主要消耗者，並使每個詞元的注意力計算日益昂貴。許多常見的淘汰策略依賴於靜態近期視窗或歷史注意力，卻忽略了每個解碼步驟中可用的訊號：模型當下的不確定性。我們提出CONF-KV，一種KV快取管理器，它將下一個詞元的分佈轉換為標量置信度分數，並以此決定每步的快取預算——當模型不確定時保留更多上下文，當模型自信時則積極修剪。在每個預算內，詞元根據累積注意力權重與近期性的複合指標進行排序，同時受保護的近期視窗維持局部連貫性。我們將此策略與區塊式線上softmax注意力、混合FP16/INT8儲存以及金字塔式逐層預算變體相結合。在四個模型系列及生成長度達4K的實驗中，CONF-KV的記憶體佔用接近固定512詞元滑動視窗，同時困惑度僅距完整KV 1.5至2.1個百分點。在長達32K詞元的「大海撈針」任務中，CONF-KV達到91.4%的檢索準確率，而滑動視窗為53.8%，H2O為80.6%；在75項VisualWebArena任務中，CONF-KV保留了完整KV 95.3%的成功率，同時峰值記憶體降低了2.8倍。

English

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.