MemServe: 弾力的なメモリプールを活用した分離型LLMサービングのためのコンテキストキャッシュ

要旨

大規模言語モデル（LLM）のサービス提供は、ステートレスからステートフルなシステムへと進化し、コンテキストキャッシングや分散推論といった技術を活用しています。これらの最適化により、KVキャッシュの寿命と適用範囲が拡大し、新しいアーキテクチャのアプローチが必要とされています。本論文では、リクエスト間およびリクエスト内の最適化を統合した統一システム「MemServe」を提案します。MemServeは、分散メモリとKVキャッシュを管理する弾力的なメモリプール「MemPool」を導入しています。MemPool APIを利用することで、MemServeは初めてコンテキストキャッシングと分散推論を組み合わせ、グローバルスケジューラによってグローバルプロンプトツリーに基づく局所性を考慮したポリシーを通じてキャッシュの再利用を強化します。テスト結果から、MemServeがジョブ完了時間と初回応答時間を大幅に改善することが示されています。

English

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

MemServe: 弾力的なメモリプールを活用した分離型LLMサービングのためのコンテキストキャッシュ

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

要旨

Support