NanoFlow：实现最佳大型语言模型服务吞吐量

摘要

随着大型语言模型（LLMs）的广泛应用，对全球规模的服务系统的需求急剧增加，这些系统需要数万个GPU不断为数亿用户提供服务。因此，吞吐量（在合理的延迟约束下）已成为决定服务系统性能的关键指标。为了提高吞吐量，已经探索了各种设备间并行性的方法（例如数据、张量、管道）。然而，现有方法并未考虑在单个设备内重叠利用不同资源，导致资源利用不足和性能亚优化。我们提出了一种新型服务框架NanoFlow，它利用设备内部并行性，通过操作协同调度在单个设备内重叠利用计算、内存和网络等资源。为了利用设备内部并行性，NanoFlow引入了两个关键创新：首先，NanoFlow将请求分割为操作粒度的纳米批次，打破了LLM推断中顺序操作的依赖关系，实现了重叠执行；然后，为了从重叠中获益，NanoFlow使用了一个具有执行单元调度的操作级流水线，该流水线将设备的功能单元进行划分，并在每个单元内同时执行不同的操作。NanoFlow通过参数搜索算法自动设置流水线，从而可以轻松将NanoFlow移植到不同的模型上。我们在NVIDIA GPU上实现了NanoFlow，并评估了几个热门模型（如LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8B等）的端到端服务吞吐量。在实际工作负载下，NanoFlow相比于最先进的服务系统提供了1.91倍的吞吐量提升，实现了跨移植模型达到59%至72%的最佳吞吐量。

English

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

NanoFlow：实现最佳大型语言模型服务吞吐量

NanoFlow: Towards Optimal Large Language Model Serving Throughput

摘要

Support