나노플로우: 대규모 언어 모델 제공 처리량 최적화를 향하여

초록

대형 언어 모델 (LLM)의 증가하는 사용은 수십만 대의 GPU가 수백만 명의 사용자에게 지속적으로 서비스하는 행성 규모의 서빙 시스템에 대한 수요가 급증하게 되었으며, 이에 따라 합리적인 대기 시간 제약 조건 하에서 처리량이 서빙 시스템의 성능을 결정하는 주요 지표로 부상했습니다. 처리량을 증가시키기 위해 데이터, 텐서, 파이프라인 등의 장치 간 병렬화 방법이 탐구되었지만, 기존 방법은 단일 장치 내에서 다양한 자원의 중첩 사용을 고려하지 않아 자원의 미사용과 최적 성능을 제공하지 못하는 문제가 있습니다. 우리는 NanoFlow를 제안합니다. 이는 연산 공동 스케줄링을 통해 단일 장치 내에서 컴퓨팅, 메모리, 네트워크 등의 자원 사용을 중첩시키는 새로운 서빙 프레임워크입니다. 장치 내 병렬화를 활용하기 위해 NanoFlow는 두 가지 주요 혁신을 도입합니다. 먼저, NanoFlow는 연산의 단위에서 요청을 나노 배치로 분할하여 LLM 추론에서 순차적 연산의 종속성을 깨고 중첩을 가능하게 합니다. 그리고 중첩을 활용하기 위해 NanoFlow는 실행 단위 스케줄링을 사용하는 연산 수준 파이프라인을 사용하여 장치의 기능 단위를 분할하고 각 단위에서 다른 연산을 동시에 실행합니다. NanoFlow는 매개변수 검색 알고리즘을 사용하여 파이프라인 설정을 자동화하며, 이를 통해 NanoFlow를 다양한 모델로 쉽게 이식할 수 있습니다. 우리는 NVIDIA GPU에서 NanoFlow를 구현하고 LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B 등의 인기 있는 모델에서 엔드 투 엔드 서빙 처리량을 평가합니다. 실제 작업 부하에서 NanoFlow는 최첨단 서빙 시스템과 비교하여 59%에서 72%의 최적 처리량을 달성하며 1.91배의 처리량 향상을 제공합니다.

English

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

나노플로우: 대규모 언어 모델 제공 처리량 최적화를 향하여

NanoFlow: Towards Optimal Large Language Model Serving Throughput

초록

Support