NanoFlow：朝向最佳化大型語言模型服務吞吐量前進

摘要

隨著大型語言模型（LLMs）的使用增加，導致對規模達行星級的服務系統需求急劇增加，這些系統需要數萬個 GPU 不斷為數億用戶提供服務。因此，在合理的延遲限制下，吞吐量已成為確定服務系統性能的關鍵指標。為了提高吞吐量，已經探索了各種設備間的並行方法（例如數據、張量、管道）。然而，現有方法並未考慮在單個設備內重疊利用不同資源，導致資源利用不足和性能次優。我們提出了 NanoFlow，這是一個新穎的服務框架，利用設備內部的並行性，通過操作協調安排在單個設備內重疊使用計算、記憶體和網路等資源。為了利用設備內部的並行性，NanoFlow 引入了兩個關鍵創新：首先，NanoFlow 將請求分割為操作級別的納米批次，打破了LLM推理中順序操作的依賴，實現了重疊；然後，為了從重疊中獲益，NanoFlow 使用了具有執行單元排程的操作級管道，該管道將設備的功能單元進行劃分，同時在每個單元中執行不同的操作。NanoFlow 通過參數搜索算法自動設置管道，從而實現了輕鬆將 NanoFlow 移植到不同模型。我們在 NVIDIA GPU 上實現了 NanoFlow，並對 LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8B 等幾個熱門模型進行了端到端服務吞吐量評估。在實際工作負載下，NanoFlow 相比於最先進的服務系統實現了 1.91倍的吞吐量提升，達到了跨移植模型的最佳吞吐量的 59% 到 72%。

English

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

NanoFlow：朝向最佳化大型語言模型服務吞吐量前進

NanoFlow: Towards Optimal Large Language Model Serving Throughput

摘要

Support