工作负载漂移下的ASR服务时长感知调度

摘要

大规模自动语音识别（ASR）服务管道中的调度策略在决定端到端（E2E）延迟方面起着关键作用。然而，广泛使用的服务引擎依赖于先来先服务（FCFS）调度，这忽略了请求持续时间的变异性，并在工作负载漂移下导致队头阻塞。我们表明，在Whisper等ASR模型中，音频时长是作业处理时间的准确代理，并利用这一洞察实现了时长感知调度。我们将两种经典算法——最短作业优先（SJF）和最高响应比优先（HRRN）——集成到vLLM中，并在实际及漂移工作负载下进行评估。在LibriSpeech测试集clean上，与基线相比，SJF在高负载下将中位端到端延迟降低了多达73%，但由于长请求的饥饿效应，将第90百分位尾部延迟增加了多达97%。HRRN解决了这一权衡：它将中位端到端延迟降低多达28%，同时将尾部延迟的恶化限制在最多24%。这些优势在工作负载漂移下仍然保持，且无吞吐量损失，每次请求的调度开销小于0.1毫秒。

English

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to 73% at high load, but increases 90th-percentile tail latency by up to 97% due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to 28% while bounding tail-latency degradation to at most 24%. These gains persist under workload drift, with no throughput penalty and <0.1\,ms scheduling overhead per request.