ChatPaper.aiChatPaper

NanoFlow:实现最佳大型语言模型服务吞吐量

NanoFlow: Towards Optimal Large Language Model Serving Throughput

August 22, 2024
作者: Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
cs.AI

摘要

随着大型语言模型(LLMs)的广泛应用,对全球规模的服务系统的需求急剧增加,这些系统需要数万个GPU不断为数亿用户提供服务。因此,吞吐量(在合理的延迟约束下)已成为决定服务系统性能的关键指标。为了提高吞吐量,已经探索了各种设备间并行性的方法(例如数据、张量、管道)。然而,现有方法并未考虑在单个设备内重叠利用不同资源,导致资源利用不足和性能亚优化。 我们提出了一种新型服务框架NanoFlow,它利用设备内部并行性,通过操作协同调度在单个设备内重叠利用计算、内存和网络等资源。为了利用设备内部并行性,NanoFlow引入了两个关键创新:首先,NanoFlow将请求分割为操作粒度的纳米批次,打破了LLM推断中顺序操作的依赖关系,实现了重叠执行;然后,为了从重叠中获益,NanoFlow使用了一个具有执行单元调度的操作级流水线,该流水线将设备的功能单元进行划分,并在每个单元内同时执行不同的操作。NanoFlow通过参数搜索算法自动设置流水线,从而可以轻松将NanoFlow移植到不同的模型上。我们在NVIDIA GPU上实现了NanoFlow,并评估了几个热门模型(如LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8B等)的端到端服务吞吐量。在实际工作负载下,NanoFlow相比于最先进的服务系统提供了1.91倍的吞吐量提升,实现了跨移植模型达到59%至72%的最佳吞吐量。
English
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

Summary

AI-Generated Summary

PDF182November 16, 2024