KV 패킷: LLM을 위한 재계산 불필요 컨텍스트 독립 KV 캐싱

초록

대규모 언어 모델(LLM)은 추론 지연 시간을 최소화하기 위해 키-값(KV) 캐싱에 크게 의존합니다. 그러나 표준 KV 캐시는 문맥에 종속적입니다. 캐시된 문서를 새로운 문맥에서 재사용하려면 주의 분포의 변화를 반영하기 위해 KV 상태를 재계산해야 합니다. CacheBlend, EPIC, SAM-KV와 같은 기존 솔루션들은 토큰 일부를 선택적으로 재계산하여 이 문제를 완화하지만, 여전히 무시할 수 없는 계산 오버헤드(FLOPs)와 증가된 첫 토큰 출력 시간(TTFT) 지연이 발생합니다. 본 논문에서는 재계산이 필요 없는 캐시 재사용 프레임워크인 KV Packet을 제안합니다. 이 방법은 캐시된 문서를 가벼운 학습 가능 소프트 토큰 어댑터로 감싼 불변의 "패킷"으로 취급하며, 자기 지도 증류를 통해 문맥 불연속성을 해소하도록 학습됩니다. Llama-3.1과 Qwen2.5에서의 실험 결과, 제안된 KV Packet 방법은 재계산 기반 기준선 대비 거의 제로에 가까운 FLOPs와 더 낮은 TTFT를 달성하면서도 완전 재계산 기준선과 유사한 F1 점수를 유지하는 것으로 나타났습니다.

English

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

KV 패킷: LLM을 위한 재계산 불필요 컨텍스트 독립 KV 캐싱

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

초록

Support