LServe: Efficiënte Long-sequence LLM-bediening met Uniforme Sparse Aandacht

Samenvatting

Grote taalmmodellen (LLMs) hebben opmerkelijke potentie getoond in het verwerken van lange sequenties, maar het efficiënt bedienen van deze lang-context modellen blijft een uitdaging vanwege de kwadratische rekencomplexiteit van aandacht in de prefilling-fase en het grote geheugenverbruik van de KV-cache in de decodering-fase. Om deze problemen aan te pakken, introduceren we LServe, een efficiënt systeem dat het bedienen van lange-sequentie LLMs versnelt via hybride sparse aandacht. Deze methode verenigt verschillende hardwarevriendelijke, gestructureerde sparsity-patronen voor zowel prefilling- als decodering-aandacht in een enkel raamwerk, waarbij berekeningen op minder belangrijke tokens bloksgewijs worden overgeslagen. LServe toont de compatibiliteit van statische en dynamische sparsity in lang-context LLM-aandacht aan. Dit ontwerp maakt multiplicatieve snelheidswinsten mogelijk door deze optimalisaties te combineren. Specifiek zetten we de helft van de aandachtskoppen om in bijna gratis streamingkoppen in zowel de prefilling- als de decodering-fasen. Daarnaast ontdekken we dat slechts een constant aantal KV-pagina's nodig is om lang-contextmogelijkheden te behouden, ongeacht de contextlengte. Vervolgens ontwerpen we een hiërarchisch KV-paginaselectiebeleid dat KV-pagina's dynamisch snoeit op basis van query-gerichte gelijkenis. Gemiddeld versnelt LServe LLM-prefilling tot 2,9x en decodering met 1,3-2,1x ten opzichte van vLLM, terwijl de nauwkeurigheid van de lang-context behouden blijft. De code is vrijgegeven op https://github.com/mit-han-lab/omniserve.

English

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

LServe: Efficiënte Long-sequence LLM-bediening met Uniforme Sparse Aandacht

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Samenvatting

Support