使用PagedAttention實現大型語言模型服務的高效記憶管理
Efficient Memory Management for Large Language Model Serving with PagedAttention
September 12, 2023
作者: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
cs.AI
摘要
為了高效地服務大型語言模型(LLMs),需要一次批量處理足夠多的請求。然而,現有系統存在困難,因為每個請求的鍵值快取(KV快取)內存龐大且動態增長和收縮。當管理效率低下時,這種內存可能會因碎片化和冗餘重複而被大量浪費,限制批量大小。為了解決這個問題,我們提出了PagedAttention,這是一種受傳統虛擬內存和分頁技術啟發的注意力算法,額外構建了vLLM,一個實現(1)KV快取內存幾乎零浪費和(2)在請求內部和跨請求靈活共享KV快取以進一步減少內存使用的LLM服務系統。我們的評估顯示,與FasterTransformer和Orca等最先進的系統相比,vLLM提高了流行LLMs的吞吐量2-4倍,並具有相同水平的延遲。在序列更長、模型更大和解碼算法更複雜的情況下,改進效果更加明顯。vLLM的源代碼可在以下網址公開獲取:https://github.com/vllm-project/vllm
English
High throughput serving of large language models (LLMs) requires batching
sufficiently many requests at a time. However, existing systems struggle
because the key-value cache (KV cache) memory for each request is huge and
grows and shrinks dynamically. When managed inefficiently, this memory can be
significantly wasted by fragmentation and redundant duplication, limiting the
batch size. To address this problem, we propose PagedAttention, an attention
algorithm inspired by the classical virtual memory and paging techniques in
operating systems. On top of it, we build vLLM, an LLM serving system that
achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV
cache within and across requests to further reduce memory usage. Our
evaluations show that vLLM improves the throughput of popular LLMs by
2-4times with the same level of latency compared to the state-of-the-art
systems, such as FasterTransformer and Orca. The improvement is more pronounced
with longer sequences, larger models, and more complex decoding algorithms.
vLLM's source code is publicly available at
https://github.com/vllm-project/vllm