FlowPrefill：将抢占机制与预填充调度粒度解耦以缓解大语言模型服务中的队头阻塞问题

摘要

大型语言模型（LLM）日益增长的需求要求服务系统能够处理大量具有多样化服务级别目标（SLO）的并发请求。这加剧了计算密集型预填充阶段中的队头（HoL）阻塞问题——长时间运行的请求会垄断资源并延迟高优先级请求，导致大范围的首令牌时间（TTFT）SLO违规。虽然分块预填充实现了可中断性，但它在响应速度与吞吐量之间形成了固有权衡：减小分块尺寸能改善响应延迟但会降低计算效率，而增大分块尺寸可最大化吞吐量却会加剧阻塞。这需要一种自适应抢占机制，但如何动态平衡执行粒度与调度开销仍是核心挑战。本文提出FlowPrefill，一种基于TTFT-优质吞吐量优化的服务系统，通过将抢占粒度与调度频率解耦来解决这一矛盾。为实现自适应预填充调度，FlowPrefill引入两大创新：1）算子级抢占技术，利用算子边界实现细粒度执行中断，避免固定小分块带来的效率损失；2）事件驱动调度机制，仅在请求到达或完成事件时触发调度决策，从而在最小化控制面开销的同时支持高效抢占响应。基于真实生产环境的追踪实验表明，FlowPrefill在满足异构SLO的前提下，相较最先进系统将最大优质吞吐量提升达5.6倍。

English

The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6times compared to state-of-the-art systems while satisfying heterogeneous SLOs.

FlowPrefill：将抢占机制与预填充调度粒度解耦以缓解大语言模型服务中的队头阻塞问题

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

摘要

Support