CONF-KV: 장기 지평 LLM을 위한 신뢰도 인식 혼합 정밀도 저장 기반 KV 캐시 축출

초록

장기 범위 LLM 추론은 키-값(KV) 캐시를 지배적인 GPU 메모리 소비자로 만들고, 토큰별 어텐션 비용을 점점 더 증가시킨다. 많은 일반적인 제거 정책은 정적 최신성 윈도우나 과거 어텐션을 사용하며, 매 디코딩 단계에서 계산되는 신호인 모델의 현재 불확실성을 활용하지 않는다. 본 논문은 CONF-KV를 소개한다. 이는 다음 토큰 분포를 스칼라 신뢰도 점수로 변환하고, 이를 이용해 단계별 캐시 예산을 결정하여 모델이 불확실할 때는 더 많은 컨텍스트를 유지하고, 확신이 있을 때는 적극적으로 가지치기(pruning)를 수행하는 KV 캐시 관리자이다. 각 예산 내에서 토큰은 누적 어텐션 질량과 최신성의 복합 지표에 따라 순위가 매겨지며, 보호된 최근 윈도우는 지역적 일관성을 유지한다. 우리는 이 정책을 블록별 온라인 소프트맥스 어텐션, 혼합 FP16/INT8 저장소, 그리고 피라미드형 레이어별 예산 변형과 결합한다. 네 가지 모델 군과 최대 4K까지의 생성 길이에 걸쳐 CONF-KV는 고정 512-토큰 슬라이딩 윈도우의 메모리 사용량에 가까우면서도, 전체 KV 대비 1.5~2.1 퍼플렉서티 포인트 이내를 유지한다. 최대 32K 토큰의 Needle-in-a-Haystack 작업에서 CONF-KV는 91.4%의 검색 정확도를 달성하며, 이는 슬라이딩 윈도우의 53.8%, H2O의 80.6%보다 높은 수치이다. 75개의 VisualWebArena 작업에서는 전체 KV 성공률의 95.3%를 유지하면서 최고 메모리는 2.8배 낮았다.

English

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.