ワークロード変動下におけるASRサービングのための持続時間認識スケジューリング

要旨

大規模自動音声認識（ASR）サービングパイプラインにおけるスケジューリングポリシーは、エンドツーエンド（E2E）レイテンシを決定する上で重要な役割を果たす。しかし、広く使われているサービングエンジンは先着順（FCFS）スケジューリングに依存しており、これはリクエスト時間長のばらつきを無視し、ワークロード変動下でヘッドオブラインブロッキングを引き起こす。我々は、WhisperのようなASRモデルにおいて、音声時間長がジョブ処理時間の正確な代理指標であることを示し、この知見を活用して時間長を考慮したスケジューリングを実現する。我々は、2つの古典的アルゴリズムである最短ジョブ優先（SJF）と最高応答比次（HRRN）をvLLMに統合し、現実的および変動のあるワークロード下で評価する。LibriSpeech test-cleanにおいて、ベースラインと比較して、SJFは高負荷時にE2Eレイテンシ中央値を最大73%削減するが、長いリクエストのスターべーションにより90パーセンタイルテールレイテンシを最大97%増加させる。HRRNはこのトレードオフに対処する。すなわち、E2Eレイテンシ中央値を最大28%削減する一方、テールレイテンシの悪化を最大24%に抑える。これらの利得はワークロード変動下でも持続し、スループットペナルティはなく、リクエストあたりのスケジューリングオーバーヘッドは0.1ミリ秒未満である。

English

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to 73% at high load, but increases 90th-percentile tail latency by up to 97% due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to 28% while bounding tail-latency degradation to at most 24%. These gains persist under workload drift, with no throughput penalty and <0.1\,ms scheduling overhead per request.