대규모 언어 모델 서빙을 위한 효율적 메모리 관리: PagedAttention 기법

초록

대규모 언어 모델(LLM)의 고처리량 서빙을 위해서는 충분히 많은 요청을 한 번에 배치 처리해야 합니다. 그러나 기존 시스템은 각 요청에 대한 키-값 캐시(KV 캐시) 메모리가 크고 동적으로 증가 및 감소하기 때문에 어려움을 겪습니다. 이러한 메모리가 비효율적으로 관리되면 단편화와 중복 복제로 인해 상당한 메모리가 낭비되어 배치 크기가 제한됩니다. 이 문제를 해결하기 위해 우리는 운영 체제의 고전적인 가상 메모리 및 페이징 기술에서 영감을 받은 어텐션 알고리즘인 PagedAttention을 제안합니다. 이를 기반으로 (1) KV 캐시 메모리의 거의 제로 웨이스트와 (2) 요청 내 및 요청 간 KV 캐시의 유연한 공유를 통해 메모리 사용량을 더욱 줄이는 LLM 서빙 시스템인 vLLM을 구축했습니다. 우리의 평가 결과, vLLM은 FasterTransformer 및 Orca와 같은 최첨단 시스템과 동일한 지연 시간 수준에서 인기 있는 LLM의 처리량을 2-4배 향상시켰습니다. 이러한 개선은 더 긴 시퀀스, 더 큰 모델, 더 복잡한 디코딩 알고리즘에서 더 두드러졌습니다. vLLM의 소스 코드는 https://github.com/vllm-project/vllm에서 공개되어 있습니다.

English

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4times with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

대규모 언어 모델 서빙을 위한 효율적 메모리 관리: PagedAttention 기법

Efficient Memory Management for Large Language Model Serving with PagedAttention

초록

Support