駕馭混沌：異構與解構化大語言模型推理的協同自動擴展

摘要

服務大型語言模型（LLMs）是一項GPU密集型的任務，傳統的自動擴展機制在此顯得力不從心，尤其是在現代預填充-解碼（Prefill-Decode, P/D）分離架構中。這種架構轉變雖然強大，卻帶來了顯著的運營挑戰，包括對異構硬體的低效利用、網絡瓶頸，以及預填充與解碼階段之間關鍵的不平衡問題。我們提出了HeteroScale，這是一個協調的自動擴展框架，專門針對P/D分離架構服務中的核心挑戰。HeteroScale結合了一個能適應異構硬體和網絡限制的拓撲感知調度器，以及一項基於首次大規模生產環境中自動擴展信號實證研究的新穎指標驅動策略。通過利用單一、穩健的指標來聯合擴展預填充和解碼池，HeteroScale在確保高效、自適應資源管理的同時，保持了架構的平衡。在部署於數萬個GPU的大規模生產環境中，HeteroScale已證明其有效性，平均GPU利用率顯著提升了26.6個百分點，每日節省了數十萬GPU小時，同時嚴格遵守了服務水平目標。

English

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

駕馭混沌：異構與解構化大語言模型推理的協同自動擴展

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

摘要

Support