驾驭混沌：面向异构与解耦大语言模型推理的协同自动扩展

摘要

部署大型语言模型（LLMs）是一项对GPU资源需求极高的任务，传统自动扩展机制在此显得力不从心，尤其是在面对现代预填充-解码（P/D）分离架构时。这一架构转变虽强大，却带来了显著的运营挑战，包括异构硬件利用效率低下、网络瓶颈以及预填充与解码阶段之间的关键性失衡。我们提出了HeteroScale，一个协调的自动扩展框架，旨在解决P/D分离架构部署中的核心难题。HeteroScale结合了一个能适应异构硬件与网络限制的拓扑感知调度器，以及一项源自首次大规模生产环境中自动扩展信号实证研究的新颖指标驱动策略。通过采用单一且稳健的指标来协同扩展预填充与解码资源池，HeteroScale在确保高效、自适应资源管理的同时，保持了架构的平衡。在数万GPU的大规模生产环境中部署后，HeteroScale展现了其卓越效能，平均GPU利用率显著提升了26.6个百分点，每日节省数十万GPU小时，同时严格满足了服务水平目标。

English

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

驾驭混沌：面向异构与解耦大语言模型推理的协同自动扩展

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

摘要

Support