DeepSpeed-FastGen: 透過MII和DeepSpeed-Inference實現大規模語言模型的高吞吐量文本生成

摘要

隨著大型語言模型（LLMs）在各種應用中的普及，部署和擴展已變得至關重要，要求高吞吐量和低延遲的服務系統。現有框架在平衡這些需求方面存在困難，特別是對於具有長提示的工作負載。本文介紹了DeepSpeed-FastGen，該系統採用了動態SplitFuse，一種新的提示和生成組合策略，可提供高達2.3倍的有效吞吐量，平均低2倍的延遲，以及高達3.7倍低（標記級）尾延遲，相較於vLLM等最先進的系統。我們利用DeepSpeed-MII和DeepSpeed-Inference的協同組合，為LLMs提供高效且易於使用的服務系統。DeepSpeed-FastGen的先進實現支持各種模型，並提供非持久性和持久性部署選項，滿足從互動會話到長時間運行應用的各種用戶場景。我們提出了詳細的基準測試方法，通過延遲-吞吐量曲線分析性能，並通過負載平衡調查可擴展性。我們的評估顯示，在各種模型和硬件配置中，吞吐量和延遲都有顯著改善。我們討論了未來增強的路線圖，包括更廣泛的模型支持和新的硬件後端。DeepSpeed-FastGen代碼已經可以供社區參與和貢獻。

English

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

DeepSpeed-FastGen: 透過MII和DeepSpeed-Inference實現大規模語言模型的高吞吐量文本生成

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

摘要

Support