LServe：基於統一稀疏注意力機制的高效長序列大語言模型服務

摘要

大型語言模型（LLMs）在處理長序列方面展現了顯著的潛力，然而，由於預填充階段注意力機制的二次計算複雜性以及解碼階段鍵值（KV）快取的大內存佔用，高效地服務這些長上下文模型仍然具有挑戰性。為解決這些問題，我們引入了LServe，這是一個通過混合稀疏注意力加速長序列LLM服務的高效系統。該方法將預填充和解碼階段的不同硬件友好型結構化稀疏模式統一在一個框架中，其中對較不重要令牌的計算以塊為單位跳過。LServe展示了靜態和動態稀疏在長上下文LLM注意力中的兼容性。這一設計通過結合這些優化實現了乘數級的加速。具體而言，我們在預填充和解碼階段將一半的注意力頭轉換為近乎免費的流式頭。此外，我們發現無論上下文長度如何，僅需常數量的KV頁即可保持長上下文能力。隨後，我們設計了一種基於查詢中心相似性的分層KV頁選擇策略，動態修剪KV頁。平均而言，LServe在vLLM基礎上將LLM預填充加速了最高2.9倍，解碼加速了1.3至2.1倍，同時保持了長上下文的準確性。代碼已發佈於https://github.com/mit-han-lab/omniserve。

English

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

LServe：基於統一稀疏注意力機制的高效長序列大語言模型服務

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

摘要

Support