MemServe：用于弹性内存池的分散式LLM服务的上下文缓存

摘要

大型语言模型（LLM）的服务已经从无状态转变为有状态系统，利用诸如上下文缓存和分解推理等技术。这些优化扩展了键值（KV）缓存的寿命和领域，需要一种新的架构方法。我们提出了MemServe，这是一个统一的系统，整合了请求间和请求内的优化。MemServe引入了MemPool，一个管理分布式内存和KV缓存的弹性内存池。利用MemPool API，MemServe首次将上下文缓存与分解推理结合起来，由全局调度器支持，通过基于全局提示树的局部感知策略增强缓存重用。测试表明，MemServe显著改善了作业完成时间和首次响应时间。

English

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

MemServe：用于弹性内存池的分散式LLM服务的上下文缓存

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

摘要

Support