NanoFlow:朝向最佳化大型語言模型服務吞吐量前進
NanoFlow: Towards Optimal Large Language Model Serving Throughput
August 22, 2024
作者: Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
cs.AI
摘要
隨著大型語言模型(LLMs)的使用增加,導致對規模達行星級的服務系統需求急劇增加,這些系統需要數萬個 GPU 不斷為數億用戶提供服務。因此,在合理的延遲限制下,吞吐量已成為確定服務系統性能的關鍵指標。為了提高吞吐量,已經探索了各種設備間的並行方法(例如數據、張量、管道)。然而,現有方法並未考慮在單個設備內重疊利用不同資源,導致資源利用不足和性能次優。
我們提出了 NanoFlow,這是一個新穎的服務框架,利用設備內部的並行性,通過操作協調安排在單個設備內重疊使用計算、記憶體和網路等資源。為了利用設備內部的並行性,NanoFlow 引入了兩個關鍵創新:首先,NanoFlow 將請求分割為操作級別的納米批次,打破了LLM推理中順序操作的依賴,實現了重疊;然後,為了從重疊中獲益,NanoFlow 使用了具有執行單元排程的操作級管道,該管道將設備的功能單元進行劃分,同時在每個單元中執行不同的操作。NanoFlow 通過參數搜索算法自動設置管道,從而實現了輕鬆將 NanoFlow 移植到不同模型。我們在 NVIDIA GPU 上實現了 NanoFlow,並對 LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8B 等幾個熱門模型進行了端到端服務吞吐量評估。在實際工作負載下,NanoFlow 相比於最先進的服務系統實現了 1.91倍的吞吐量提升,達到了跨移植模型的最佳吞吐量的 59% 到 72%。
English
The increasing usage of Large Language Models (LLMs) has resulted in a
surging demand for planet-scale serving systems, where tens of thousands of
GPUs continuously serve hundreds of millions of users. Consequently, throughput
(under reasonable latency constraints) has emerged as a key metric that
determines serving systems' performance. To boost throughput, various methods
of inter-device parallelism (e.g., data, tensor, pipeline) have been explored.
However, existing methods do not consider overlapping the utilization of
different resources within a single device, leading to underutilization and
sub-optimal performance.
We propose NanoFlow, a novel serving framework that exploits intra-device
parallelism, which overlaps the usage of resources including compute, memory,
and network within a single device through operation co-scheduling. To exploit
intra-device parallelism, NanoFlow introduces two key innovations: First,
NanoFlow splits requests into nano-batches at the granularity of operations,
which breaks the dependency of sequential operations in LLM inference and
enables overlapping; then, to get benefit from overlapping, NanoFlow uses an
operation-level pipeline with execution unit scheduling, which partitions the
device's functional units and simultaneously executes different operations in
each unit. NanoFlow automates the pipeline setup using a parameter search
algorithm, which enables easily porting NanoFlow to different models. We
implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on
several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc..
With practical workloads, NanoFlow provides 1.91x throughput boost compared to
state-of-the-art serving systems achieving 59% to 72% of optimal throughput
across ported models.Summary
AI-Generated Summary