FlowPrefill: LLM 서빙에서 선점과 Prefill 스케줄링 세분화의 분리를 통한 선두 차단 완화

초록

대규모 언어 모델(LLM)에 대한 수요 증가로 인해 다양한 서비스 수준 목표(SLO)를 가진 많은 동시 요청을 처리하는 서빙 시스템이 필요해졌습니다. 이는 계산 집약적인 프리필 단계에서 헤드오브라인(HoL) 블로킹을 악화시킵니다. 이 단계에서 장시간 실행되는 요청이 리소스를 독점하여 우선순위가 높은 요청들의 지연을 초래하고, 결과적으로 시간 내 첫 토큰(TTFT) SLO 위반이 광범위하게 발생합니다. 청킹 프리필은 중단 가능성을 제공하지만, 응답성과 처리량 사이의 본질적인 트레이드오프가 존재합니다. 청크 크기를 줄이면 응답 지연 시간은 개선되지만 계산 효율성이 저하되고, 청크 크기를 늘리면 처리량은 극대화되지만 블로킹이 악화됩니다. 따라서 적응형 선점 메커니즘이 필수적입니다. 그러나 실행 세분화와 스케줄링 오버헤드를 동적으로 균형 잡는 것은 여전히 핵심 과제로 남아 있습니다. 본 논문에서는 이러한 갈등을 해결하기 위해 선점 세분화와 스케줄링 빈도를 분리하는 TTFT-굿풋 최적화 서빙 시스템인 FlowPrefill을 제안합니다. 적응형 프리필 스케줄링을 구현하기 위해 FlowPrefill은 두 가지 핵심 혁신을 도입합니다: 1) **연산자 수준 선점**: 고정된 작은 청킹과 관련된 효율성 손실 없이 세분화된 실행 중단을 가능하게 하기 위해 연산자 경계를 활용합니다. 2) **이벤트 기반 스케줄링**: 요청 도착 또는 완료 시점에만 스케줄링 결정을 트리거하여 효율적인 선점 응답성을 지원하면서 제어 평면 오버헤드를 최소화합니다. 실제 프로덕션 트레이스에 대한 평가 결과, FlowPrefill은 최신 시스템과 비교하여 이질적인 SLO를 만족시키면서 최대 굿풋을 최대 5.6배까지 향상시키는 것으로 나타났습니다.

English

The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6times compared to state-of-the-art systems while satisfying heterogeneous SLOs.

FlowPrefill: LLM 서빙에서 선점과 Prefill 스케줄링 세분화의 분리를 통한 선두 차단 완화

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

초록

Support