ChatPaper.aiChatPaper

通过学习排名实现高效的LLM调度

Efficient LLM Scheduling by Learning to Rank

August 28, 2024
作者: Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang
cs.AI

摘要

在大型语言模型(LLM)推理中,LLM请求的输出长度通常被视为事先未知。因此,大多数LLM服务系统采用简单的先来先服务(FCFS)调度策略,导致先到先服务(HOL)阻塞,降低了吞吐量和服务质量。在本文中,我们重新审视这一假设--我们展示了,虽然预测每个请求的确切生成长度是不可行的,但可以使用学习排序来预测一批请求中输出长度的相对排名。排名信息为请求调度提供了宝贵的指导。基于这一见解,我们开发了一种新颖的LLM推理和服务调度器,可以更好地近似最短作业优先(SJF)调度,优于现有方法。我们将此调度器与最先进的LLM服务系统集成,并在几个重要应用中展示了显著的性能改进:在聊天机器人服务中延迟降低了2.8倍,在合成数据生成中吞吐量提高了6.5倍。我们的代码可在 https://github.com/hao-ai-lab/vllm-ltr.git 获取。
English
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

Summary

AI-Generated Summary

PDF212November 16, 2024