MemServe：具彈性記憶體池的分散式LLM服務的上下文快取

摘要

大型語言模型（LLM）服務已從無狀態轉變為有狀態系統，利用技術如上下文緩存和解聚推理。這些優化擴展了 KV 緩存的壽命和範圍，需要一種新的架構方法。我們提出了 MemServe，一個統一的系統，整合了跨請求和內請求優化。MemServe 引入了 MemPool，一個管理分佈式內存和 KV 緩存的彈性內存池。使用 MemPool APIs，MemServe 首次將上下文緩存與解聚推理結合起來，並由全局調度程序支持，通過全局提示樹為基礎的本地性感知策略增強緩存重用。測試顯示，MemServe 顯著改善了作業完成時間和首次回應時間。

English

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

MemServe：具彈性記憶體池的分散式LLM服務的上下文快取

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

摘要

Support