RetroInfer: 확장 가능한 장문맥 LLM 추론을 위한 벡터 저장 기반 접근법

초록

대규모 언어 모델(LLM)의 점점 증가하는 컨텍스트 길이는 GPU 메모리와 대역폭 제약으로 인해 효율적인 추론에 상당한 어려움을 야기합니다. 본 논문에서는 키-값(KV) 캐시를 벡터 저장 시스템으로 재구성하여 내재된 어텐션 희소성을 활용해 장문 컨텍스트 LLM 추론을 가속화하는 새로운 시스템인 RetroInfer를 소개합니다. 이 시스템의 핵심은 웨이브 인덱스(Attention-aWare VEctor index)로, 삼분할 어텐션 근사화, 정확도 제한 어텐션 추정, 세그먼트화 클러스터링과 같은 기법을 통해 중요한 토큰을 효율적이고 정확하게 검색할 수 있게 합니다. 이를 보완하는 웨이브 버퍼는 KV 캐시 배치를 조정하고 GPU와 CPU 간의 계산 및 데이터 전송을 중첩시켜 높은 처리량을 유지합니다. 기존의 희소성 기반 방법들이 토큰 선택과 하드웨어 조정에 어려움을 겪던 것과 달리, RetroInfer는 모델 정확도를 저해하지 않으면서도 견고한 성능을 제공합니다. 장문 컨텍스트 벤치마크에서의 실험 결과, GPU 메모리 한계 내에서 전체 어텐션 대비 최대 4.5배, KV 캐시를 CPU 메모리로 확장했을 때 희소 어텐션 기준선 대비 최대 10.5배의 속도 향상을 보였으며, 전체 어텐션 수준의 정확도를 유지했습니다.

English

The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

RetroInfer: 확장 가능한 장문맥 LLM 추론을 위한 벡터 저장 기반 접근법

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

초록

Support