混沌の制御：異種分散型LLM推論のための協調的オートスケーリング

要旨

大規模言語モデル（LLMs）の運用はGPU集約的なタスクであり、特に現代のPrefill-Decode（P/D）分離アーキテクチャにおいて、従来のオートスケーラーでは十分な対応が難しい。このアーキテクチャの変化は強力である一方、異種ハードウェアの非効率的な使用、ネットワークのボトルネック、Prefill段階とDecode段階の間の深刻な不均衡など、重要な運用上の課題を引き起こす。本論文では、P/D分離型運用の核心的な課題に対処する協調的オートスケーリングフレームワーク「HeteroScale」を提案する。HeteroScaleは、異種ハードウェアとネットワーク制約に適応するトポロジー認識スケジューラと、本番環境におけるオートスケーリングシグナルの大規模な実証研究に基づく新規なメトリック駆動ポリシーを組み合わせている。単一の堅牢なメトリックを活用してPrefillプールとDecodeプールを共同でスケーリングすることで、HeteroScaleはアーキテクチャのバランスを維持しつつ、効率的で適応的なリソース管理を実現する。数万のGPUを擁する大規模な本番環境に導入されたHeteroScaleは、平均GPU使用率を26.6パーセンテージポイント向上させ、毎日数十万GPU時間を節約する効果を証明し、厳格なサービスレベル目標を維持している。

English

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

混沌の制御：異種分散型LLM推論のための協調的オートスケーリング

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

要旨

Support