KVzip: 컨텍스트 재구성을 통한 쿼리-불가지론적 KV 캐시 압축

초록

Transformer 기반의 대규모 언어 모델(LLMs)은 추론 과정에서 컨텍스트를 키-값(KV) 쌍으로 캐싱합니다. 컨텍스트 길이가 증가함에 따라 KV 캐시 크기도 확장되어, 상당한 메모리 오버헤드와 증가된 어텐션 지연 시간을 초래합니다. 본 논문에서는 다양한 쿼리에서 압축된 KV 캐시를 효과적으로 재사용할 수 있는 쿼리-불특정 KV 캐시 제거 방법인 KVzip을 소개합니다. KVzip은 기본 LLM을 사용하여 캐시된 KV 쌍에서 원래 컨텍스트를 재구성함으로써 KV 쌍의 중요도를 정량화하고, 이에 따라 중요도가 낮은 쌍을 제거합니다. 광범위한 실험 평가를 통해 KVzip이 KV 캐시 크기를 3-4배 감소시키고 FlashAttention 디코딩 지연 시간을 약 2배 단축시키며, 질문 응답, 검색, 추론 및 코드 이해 작업에서의 성능 손실이 미미함을 입증했습니다. 평가에는 LLaMA3.1-8B, Qwen2.5-14B, Gemma3-12B 등 다양한 모델이 포함되었으며, 컨텍스트 길이는 최대 170K 토큰에 달했습니다. KVzip은 다중 쿼리 시나리오에서 90% 캐시 예산 비율에서도 성능 저하를 겪는 기존의 쿼리-인식 KV 제거 방법을 크게 능가합니다.

English

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4times and FlashAttention decoding latency by approximately 2times, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

KVzip: 컨텍스트 재구성을 통한 쿼리-불가지론적 KV 캐시 압축

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

초록

Support