ChatPaper.aiChatPaper

FlowPrefill:将抢占与预填充调度粒度解耦以缓解大语言模型服务中的队头阻塞问题

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

February 18, 2026
作者: Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen
cs.AI

摘要

大型语言模型(LLM)服务需求的增长对推理系统提出了更高要求,需要同时处理大量具有差异化服务等级目标(SLO)的并发请求。这加剧了计算密集型预填充阶段中的队头阻塞问题:长时请求独占资源会导致高优先级请求被延迟,进而引发普遍的首令牌生成时间(TTFT)SLO违约。虽然分块预填充技术实现了可中断性,但带来了响应速度与吞吐量之间的固有矛盾——减小分块尺寸可降低响应延迟却会损害计算效率,而增大分块尺寸虽能最大化吞吐量却会加剧阻塞。这要求系统必须具备自适应抢占机制,但如何动态平衡执行粒度与调度开销仍是核心挑战。 本文提出FlowPrefill系统,通过解耦抢占粒度与调度频率来解决这一矛盾,实现TTFT与优质吞吐量的协同优化。该系统包含两项关键技术突破:1)算子级抢占机制,利用算子边界实现细粒度执行中断,避免固定小分块导致的效率损失;2)事件驱动调度策略,仅在请求到达或完成时触发调度决策,在保证高效抢占响应能力的同时最小化控制面开销。基于真实生产流量的实验表明,FlowPrefill在满足异构SLO的前提下,较现有最优系统将最大优质吞吐量提升达5.6倍。
English
The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6times compared to state-of-the-art systems while satisfying heterogeneous SLOs.
PDF12March 28, 2026