KVServe：面向通信高效分離式大語言模型服務的服務感知KV緩存壓縮

摘要

大型語言模型（LLM）廣泛應用於生產環境，對推論系統的效能提出嚴峻考驗。分離式LLM服務（例如PD分離與KV狀態解耦）提升了可擴展性與成本效益，但同時也使KV成為跨越網路與儲存邊界的顯式傳輸負載，導致KV成為端到端的主要瓶頸。現有的KV壓縮通常採用靜態執行時期配置，然而生產服務情境在工作負載組合、頻寬及SLO/品質預算上隨時間變化，因此固定的選擇可能導致次佳表現，甚至增加延遲。我們提出《KVServe》，首個針對分離式LLM服務的感知服務且自適應的KV通訊壓縮框架：KVServe (1) 將KV壓縮統一為模組化策略空間，包含新元件與跨方法重組；(2) 引入貝葉斯分析引擎，高效搜尋此空間並提煉出3D帕累托候選集，將離線搜尋開銷降低50倍；(3) 部署感知服務的線上控制器，結合分析性延遲模型與輕量級強盜演算法，在限制條件下選擇設定檔並修正離線與線上的不匹配。整合至vLLM並於多個數據集、模型、GPU及網路上進行評估，KVServe在PD分離服務中實現高達9.13倍的JCT加速，在KV解耦服務中實現高達32.8倍的TTFT降低。

English

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present KVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13times JCT speedup in PD-separated serving and up to 32.8times TTFT reduction in KV-disaggregated serving.