LookaheadKV: 생성 없이 미래를 예측하여 빠르고 정확한 KV 캐시 제거

초록

트랜스포머 기반 대규모 언어 모델(LLM)은 자기회귀 추론 과정에서 중복 계산을 피하기 위해 키-값(KV) 캐싱에 의존합니다. 이 메커니즘은 효율성을 크게 향상시키지만, 캐시 크기는 입력 시퀀스 길이에 따라 선형적으로 증가하여 장문맥 작업에서 빠르게 병목 현상이 됩니다. 기존 해결책은 추정된 중요도 점수를 바탕으로 중요하지 않은 것으로 판단되는 프롬프트 KV를 제거하는 방식으로 이 문제를 완화합니다. 특히 최근 연구 동향은 "미리 엿보기"를 통해 제거 품질을 향상시키는 것을 제안하는데, 여기서는 초안 생성기가 대상 모델의 실제 응답을 근사하는 대리 미래 응답을 생성한 후, 이 대리 응답을 사용하여 캐시된 KV의 중요도를 더 정확하게 추정합니다. 그러나 이러한 접근 방식은 계산 비용이 많이 드는 초안 생성에 의존하여 상당한 프리필링 오버헤드를 초래하고 실제 배포에서의 실용성을 제한합니다. 이러한 과제를 해결하기 위해 우리는 명시적인 초안 생성 없이도 대리 미래 응답의 강점을 활용하는 경량화된 제거 프레임워크인 LookaheadKV를 제안합니다. LookaheadKV는 매개변수 효율적인 모듈을 트랜스포머 레이어에 추가하여 실제 중요도 점수를 높은 정확도로 예측하도록 학습합니다. 우리의 설계는 기존의 저비용 휴리스틱 방법에 필적하는 무시할 수 있는 런타임 오버헤드를 보장하면서도, 더 비싼 근사 방법보다 우수한 정확도를 달성합니다. 다양한 모델을 대상으로 한 장문맥 이해 벤치마크에서의 폭넓은 실험을 통해 우리의 방법이 다양한 장문맥 이해 작업에서 최근 경쟁력 있는 기준선들을 능가할 뿐만 아니라, 제거 비용을 최대 14.5배까지 줄여 첫 토큰 출력 시간을 크게 단축함을 입증합니다. 우리의 코드는 https://github.com/SamsungLabs/LookaheadKV에서 확인할 수 있습니다.

English

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

LookaheadKV: 생성 없이 미래를 예측하여 빠르고 정확한 KV 캐시 제거

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

초록

Support