透過學習排名實現高效的LLM排程
Efficient LLM Scheduling by Learning to Rank
August 28, 2024
作者: Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang
cs.AI
摘要
在大型語言模型(LLM)推論中,LLM請求的輸出長度通常被視為事先未知。因此,大多數LLM服務系統採用簡單的先到先服務(FCFS)排程策略,導致先到先服務(HOL)阻塞,降低吞吐量和服務質量。本文重新檢視這一假設--我們指出,雖然預測每個請求的確切生成長度是不可行的,但可以使用學習排序來預測一批請求中輸出長度的相對排名。排名信息為請求排程提供了有價值的指導。基於這一見解,我們開發了一種新型的LLM推論和服務排程器,可以比現有方法更好地近似最短作業優先(SJF)排程。我們將此排程器與最先進的LLM服務系統集成,並在幾個重要應用中展示了顯著的性能改進:在聊天機器人服務中降低了2.8倍的延遲,合成數據生成的吞吐量提高了6.5倍。我們的程式碼可在https://github.com/hao-ai-lab/vllm-ltr.git 上找到。
English
In Large Language Model (LLM) inference, the output length of an LLM request
is typically regarded as not known a priori. Consequently, most LLM serving
systems employ a simple First-come-first-serve (FCFS) scheduling strategy,
leading to Head-Of-Line (HOL) blocking and reduced throughput and service
quality. In this paper, we reexamine this assumption -- we show that, although
predicting the exact generation length of each request is infeasible, it is
possible to predict the relative ranks of output lengths in a batch of
requests, using learning to rank. The ranking information offers valuable
guidance for scheduling requests. Building on this insight, we develop a novel
scheduler for LLM inference and serving that can approximate the
shortest-job-first (SJF) schedule better than existing approaches. We integrate
this scheduler with the state-of-the-art LLM serving system and show
significant performance improvement in several important applications: 2.8x
lower latency in chatbot serving and 6.5x higher throughput in synthetic data
generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.gitSummary
AI-Generated Summary