ChatPaper.aiChatPaper

CONF-KV:面向长序列大语言模型的置信度感知KV缓存淘汰与混合精度存储

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

May 24, 2026
作者: Yubo Li, Yidi Miao
cs.AI

摘要

长序列LLM推理使键值(KV)缓存成为GPU内存的主要消耗者,并导致每令牌注意力计算愈发昂贵。许多常见的驱逐策略使用静态时效窗口或历史注意力,而忽略了每个解码步骤中计算的一个信号:模型当前的不确定性。我们提出CONF-KV,一种KV缓存管理器,它将下一令牌分布转换为一个标量置信度分数,并据此选择每步的缓存预算:当模型不确定时保留更多上下文,当模型自信时激进地剪枝。在每个预算内,令牌根据累积注意力质量和时效性的综合指标进行排序,同时一个受保护的近期窗口保持局部连贯性。我们将该策略与分块在线softmax注意力、混合FP16/INT8存储以及金字塔逐层预算变体相结合。在四个模型家族和长达4K的生成序列上,CONF-KV的占用空间接近固定512令牌滑动窗口,同时困惑度与完整KV相比仅相差1.5-2.1个点。在长达32K令牌的“大海捞针”任务中,CONF-KV达到91.4%的检索准确率,而滑动窗口为53.8%,H2O为80.6%;在75个VisualWebArena任务中,它保留了完整KV成功率的95.3%,同时峰值内存降低2.8倍。
English
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.