KVServe：面向通信高效的解耦大语言模型服务的服务感知KV缓存压缩

摘要

大语言模型（LLMs）在生产环境中被广泛采用，将推理系统推向其性能极限。分离式LLM服务（例如PD分离和KV状态分离）提升了可扩展性和成本效率，但也使得KV成为跨越网络和存储边界的显式负载，导致KV成为端到端的主要瓶颈。现有的KV压缩通常是静态的运行时配置，尽管生产服务环境在工作负载组合、带宽以及SLO/质量预算方面随时间变化。因此，固定选择可能并非最优，甚至会增加延迟。我们提出KVServe，这是首个面向分离式LLM服务的服务感知自适应KV通信压缩框架：KVServe（1）将KV压缩统一到一个包含新组件和跨方法重组的模块化策略空间中；（2）引入贝叶斯性能分析引擎，高效搜索该空间并提炼出三维Pareto候选集，将离线搜索开销降低50倍；（3）部署服务感知在线控制器，结合分析延迟模型与轻量级Bandit算法，在约束条件下选择配置并纠正离线到在线的偏差。通过集成到vLLM中并在多种数据集、模型、GPU和网络上进行评估，KVServe在PD分离服务中实现了高达9.13倍的JCT加速，在KV分离服务中实现了高达32.8倍的TTFT降低。

English

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present KVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13times JCT speedup in PD-separated serving and up to 32.8times TTFT reduction in KV-disaggregated serving.