MemServe:用于弹性内存池的分散式LLM服务的上下文缓存
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
June 25, 2024
作者: Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
cs.AI
摘要
大型语言模型(LLM)的服务已经从无状态转变为有状态系统,利用诸如上下文缓存和分解推理等技术。这些优化扩展了键值(KV)缓存的寿命和领域,需要一种新的架构方法。我们提出了MemServe,这是一个统一的系统,整合了请求间和请求内的优化。MemServe引入了MemPool,一个管理分布式内存和KV缓存的弹性内存池。利用MemPool API,MemServe首次将上下文缓存与分解推理结合起来,由全局调度器支持,通过基于全局提示树的局部感知策略增强缓存重用。测试表明,MemServe显著改善了作业完成时间和首次响应时间。
English
Large language model (LLM) serving has transformed from stateless to stateful
systems, utilizing techniques like context caching and disaggregated inference.
These optimizations extend the lifespan and domain of the KV cache,
necessitating a new architectural approach. We present MemServe, a unified
system that integrates both inter-request and intra-request optimizations.
MemServe introduces MemPool, an elastic memory pool managing distributed memory
and KV caches across serving instances. Using MemPool APIs, MemServe combines
context caching with disaggregated inference for the first time, supported by a
global scheduler that enhances cache reuse through a global prompt tree-based
locality-aware policy. Tests show that MemServe significantly improves job
completion time and time-to-first-time.Summary
AI-Generated Summary