네이티브 위치 독립 캐싱을 위해서는 인코더가 필요합니다

초록

대규모 언어 모델(LLM)의 키-값(KV) 캐시는 접두사 기반으로 동작하여 임의 순서로 검색된 컨텍스트를 처리하는 데 매우 비효율적입니다. 위치 독립 캐싱(PIC)은 위치 제약 없이 KV 재사용을 가능하게 하기 위해 제안되었으나, 기존 접근법들은 종종 상당한 정확도 저하를 초래하여 실제 적용에 한계가 있었습니다. 이러한 문제를 해결하기 위해, 우리는 주류의 디코더 전용 LLM에 인코더를 재도입하고 PIC를 지원하도록 명시적으로 학습시키는 네이티브 PIC를 제안합니다. 또한 기존 추론 프레임워크와 원활하게 통합되는 PIC 인식 캐싱 시스템인 COMB를 개발했습니다. 실험 결과, COMB는 첫 토큰 출력 시간(TTFT)을 51-94% 단축하고 처리량을 3배 증가시키면서도 비슷한 정확도를 유지하는 것으로 나타났습니다. 더 나아가 DeepSeek-V2-Lite-Chat 사용 시 품질 향상이 관찰되어 COMB가 다른 유형의 디코더 전용 LLM에도 적용 가능함을 입증했습니다. 우리의 코드는 https://github.com/shijuzhao/Comb에서 확인할 수 있습니다.

English

The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3times with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at https://github.com/shijuzhao/Comb.

네이티브 위치 독립 캐싱을 위해서는 인코더가 필요합니다

You Need an Encoder for Native Position-Independent Caching

초록

Support