MemServe: Contextcaching voor gedisaggregeerde LLM-serving met een elastische geheugenpool

Samenvatting

Het serveren van grote taalmodellen (LLM) is getransformeerd van stateless naar stateful systemen, waarbij technieken zoals context caching en gedisaggregeerde inferentie worden gebruikt. Deze optimalisaties verlengen de levensduur en het domein van de KV-cache, wat een nieuwe architecturale aanpak vereist. Wij presenteren MemServe, een geïntegreerd systeem dat zowel inter-request als intra-request optimalisaties combineert. MemServe introduceert MemPool, een elastische geheugenpool die gedistribueerd geheugen en KV-caches beheert over verschillende serverinstanties. Met behulp van MemPool API's combineert MemServe voor het eerst context caching met gedisaggregeerde inferentie, ondersteund door een globale scheduler die de cache-hergebruik verbetert via een globaal prompt tree-based locality-aware beleid. Tests tonen aan dat MemServe de taakvoltooiingstijd en de time-to-first-time aanzienlijk verbetert.

English

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

MemServe: Contextcaching voor gedisaggregeerde LLM-serving met een elastische geheugenpool

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Samenvatting

Support