使用PagedAttention實現大型語言模型服務的高效記憶管理

摘要

為了高效地服務大型語言模型（LLMs），需要一次批量處理足夠多的請求。然而，現有系統存在困難，因為每個請求的鍵值快取（KV快取）內存龐大且動態增長和收縮。當管理效率低下時，這種內存可能會因碎片化和冗餘重複而被大量浪費，限制批量大小。為了解決這個問題，我們提出了PagedAttention，這是一種受傳統虛擬內存和分頁技術啟發的注意力算法，額外構建了vLLM，一個實現（1）KV快取內存幾乎零浪費和（2）在請求內部和跨請求靈活共享KV快取以進一步減少內存使用的LLM服務系統。我們的評估顯示，與FasterTransformer和Orca等最先進的系統相比，vLLM提高了流行LLMs的吞吐量2-4倍，並具有相同水平的延遲。在序列更長、模型更大和解碼算法更複雜的情況下，改進效果更加明顯。vLLM的源代碼可在以下網址公開獲取：https://github.com/vllm-project/vllm

English

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4times with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

使用PagedAttention實現大型語言模型服務的高效記憶管理

Efficient Memory Management for Large Language Model Serving with PagedAttention

摘要

Support