ChatPaper.aiChatPaper

DeepSpeed-FastGen:通过MII和DeepSpeed-Inference实现LLM的高吞吐量文本生成

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

January 9, 2024
作者: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He
cs.AI

摘要

随着大型语言模型(LLMs)在各种应用中的普及,对其部署和扩展变得至关重要,要求高吞吐量和低延迟的服务系统。现有框架在平衡这些要求方面存在困难,特别是对于具有长提示的工作负载。本文介绍了DeepSpeed-FastGen,这是一个采用动态SplitFuse策略的系统,该策略是一种新颖的提示和生成组合策略,可提供高达2.3倍的有效吞吐量,平均降低2倍的延迟,以及高达3.7倍的更低(标记级别)尾延迟,相较于vLLM等最先进的系统。我们利用DeepSpeed-MII和DeepSpeed-Inference的协同组合,为LLMs提供高效且易于使用的服务系统。DeepSpeed-FastGen的先进实现支持一系列模型,并提供非持久性和持久性部署选项,满足从交互会话到长时间运行应用的各种用户场景。我们提出了详细的基准测试方法,通过延迟-吞吐量曲线分析性能,并通过负载平衡调查可扩展性。我们的评估显示,在各种模型和硬件配置中,吞吐量和延迟均有显著改善。我们讨论了未来增强的路线图,包括更广泛的模型支持和新的硬件后端。DeepSpeed-FastGen代码已经准备好供社区参与和贡献。
English
The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.
PDF152December 15, 2024