透過學習排名實現高效的LLM排程

摘要

在大型語言模型（LLM）推論中，LLM請求的輸出長度通常被視為事先未知。因此，大多數LLM服務系統採用簡單的先到先服務（FCFS）排程策略，導致先到先服務（HOL）阻塞，降低吞吐量和服務質量。本文重新檢視這一假設--我們指出，雖然預測每個請求的確切生成長度是不可行的，但可以使用學習排序來預測一批請求中輸出長度的相對排名。排名信息為請求排程提供了有價值的指導。基於這一見解，我們開發了一種新型的LLM推論和服務排程器，可以比現有方法更好地近似最短作業優先（SJF）排程。我們將此排程器與最先進的LLM服務系統集成，並在幾個重要應用中展示了顯著的性能改進：在聊天機器人服務中降低了2.8倍的延遲，合成數據生成的吞吐量提高了6.5倍。我們的程式碼可在https://github.com/hao-ai-lab/vllm-ltr.git 上找到。

English

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

透過學習排名實現高效的LLM排程

Efficient LLM Scheduling by Learning to Rank

摘要

Support