NanoFlow: 大規模言語モデルの最適なサービングスループットに向けて

要旨

大規模言語モデル（LLM）の利用増加により、数万のGPUが常に数億人のユーザーにサービスを提供する地球規模のサービングシステムへの需要が急増しています。その結果、合理的なレイテンシ制約下でのスループットが、サービングシステムのパフォーマンスを決定する主要な指標として浮上しています。スループットを向上させるために、データ、テンソル、パイプラインなどのデバイス間並列処理のさまざまな手法が探求されてきました。しかしながら、既存の手法は単一デバイス内の異なるリソースの利用の重複を考慮しておらず、リソースの未使用やサブ最適なパフォーマンスを引き起こしています。当研究では、NanoFlowという新しいサービングフレームワークを提案します。NanoFlowは、オペレーションの同時スケジューリングを通じて、単一デバイス内で計算、メモリ、ネットワークなどのリソースの利用を重複させる、デバイス内並列処理を活用します。デバイス内並列処理を活用するために、NanoFlowは2つの主要な革新を導入しています。まず、NanoFlowはリクエストをオペレーションの単位でナノバッチに分割し、LLM推論における連続オペレーションの依存関係を解消し、重複を可能にします。そして、重複を活用するために、NanoFlowは実行ユニットのスケジューリングを備えたオペレーションレベルのパイプラインを使用し、デバイスの機能ユニットをパーティション化し、各ユニットで異なるオペレーションを同時に実行します。NanoFlowは、パラメータ検索アルゴリズムを使用してパイプラインのセットアップを自動化し、異なるモデルにNanoFlowを簡単に移植できるようにしています。私たちは、NanoFlowをNVIDIA GPU上で実装し、LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8Bなどのいくつかの人気モデルでエンドツーエンドのサービングスループットを評価しました。実用的なワークロードにおいて、NanoFlowは、最先端のサービングシステムと比較して1.91倍のスループット向上を提供し、ポートされたモデル全体で最適スループットの59%から72%を達成しています。

English

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

NanoFlow: 大規模言語モデルの最適なサービングスループットに向けて

NanoFlow: Towards Optimal Large Language Model Serving Throughput

要旨

Support