利用PagedAttention 实现大型语言模型服务的高效内存管理

摘要

为了高效地为大型语言模型（LLM）提供高吞吐量服务，需要一次性批量处理足够多的请求。然而，现有系统存在困难，因为每个请求的键-值缓存（KV缓存）内存庞大且动态增长和收缩。当管理效率低下时，这种内存可能会因碎片化和冗余复制而被大量浪费，从而限制批处理大小。为了解决这个问题，我们提出了PagedAttention，这是一种受经典虚拟内存和分页技术启发的注意力算法，类似于操作系统中的技术。在此基础上，我们构建了vLLM，这是一个LLM服务系统，实现了（1）KV缓存内存几乎零浪费，以及（2）在请求内部和跨请求之间灵活共享KV缓存，进一步减少内存使用。我们的评估显示，与FasterTransformer和Orca等最先进系统相比，vLLM将流行的LLM的吞吐量提高了2-4倍，且具有相同水平的延迟。随着序列更长、模型更大和解码算法更复杂，改进效果更加显著。vLLM的源代码可在以下网址公开获取：https://github.com/vllm-project/vllm

English

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4times with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

利用PagedAttention 实现大型语言模型服务的高效内存管理

Efficient Memory Management for Large Language Model Serving with PagedAttention

摘要

Support