워크로드 변동 하에서의 ASR 서빙을 위한 지속 시간 인식 스케줄링

초록

대규모 자동 음성 인식(ASR) 서빙 파이프라인의 스케줄링 정책은 종단 간(E2E) 지연 시간을 결정하는 데 핵심적인 역할을 한다. 그러나 널리 사용되는 서빙 엔진은 선입선출(FCFS) 스케줄링에 의존하는데, 이는 요청 지속 시간의 변동성을 무시하고 워크로드 변동 아래에서 선두 차단(Head-of-Line Blocking)을 초래한다. 우리는 Whisper와 같은 ASR 모델에서 오디오 지속 시간이 작업 처리 시간의 정확한 대리 변수임을 보여주고, 이 통찰력을 활용하여 지속 시간 인식 스케줄링을 가능하게 한다. 우리는 두 가지 고전 알고리즘인 최단 작업 우선(SJF)과 최고 응답 비율 우선(HRRN)을 vLLM에 통합하고, 현실적 및 변동된 워크로드에서 평가한다. LibriSpeech test-clean에서 기준선과 비교할 때, SJF는 높은 부하에서 E2E 중간 지연 시간을 최대 73% 감소시키지만, 긴 요청의 기아 현상으로 인해 90번째 백분위 꼬리 지연 시간을 최대 97% 증가시킨다. HRRN은 이러한 절충점을 해결한다: 꼬리 지연 시간 저하를 최대 24%로 제한하면서 E2E 중간 지연 시간을 최대 28% 감소시킨다. 이러한 이점은 워크로드 변동 아래에서도 지속되며, 처리량 손실이 없고 요청당 0.1ms 미만의 스케줄링 오버헤드가 발생한다.

English

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to 73% at high load, but increases 90th-percentile tail latency by up to 97% due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to 28% while bounding tail-latency degradation to at most 24%. These gains persist under workload drift, with no throughput penalty and <0.1\,ms scheduling overhead per request.