DeepSpeed-FastGen: 透過MII和DeepSpeed-Inference實現大規模語言模型的高吞吐量文本生成
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
January 9, 2024
作者: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He
cs.AI
摘要
隨著大型語言模型(LLMs)在各種應用中的普及,部署和擴展已變得至關重要,要求高吞吐量和低延遲的服務系統。現有框架在平衡這些需求方面存在困難,特別是對於具有長提示的工作負載。本文介紹了DeepSpeed-FastGen,該系統採用了動態SplitFuse,一種新的提示和生成組合策略,可提供高達2.3倍的有效吞吐量,平均低2倍的延遲,以及高達3.7倍低(標記級)尾延遲,相較於vLLM等最先進的系統。我們利用DeepSpeed-MII和DeepSpeed-Inference的協同組合,為LLMs提供高效且易於使用的服務系統。DeepSpeed-FastGen的先進實現支持各種模型,並提供非持久性和持久性部署選項,滿足從互動會話到長時間運行應用的各種用戶場景。我們提出了詳細的基準測試方法,通過延遲-吞吐量曲線分析性能,並通過負載平衡調查可擴展性。我們的評估顯示,在各種模型和硬件配置中,吞吐量和延遲都有顯著改善。我們討論了未來增強的路線圖,包括更廣泛的模型支持和新的硬件後端。DeepSpeed-FastGen代碼已經可以供社區參與和貢獻。
English
The deployment and scaling of large language models (LLMs) have become
critical as they permeate various applications, demanding high-throughput and
low-latency serving systems. Existing frameworks struggle to balance these
requirements, especially for workloads with long prompts. This paper introduces
DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and
generation composition strategy, to deliver up to 2.3x higher effective
throughput, 2x lower latency on average, and up to 3.7x lower (token-level)
tail latency, compared to state-of-the-art systems like vLLM. We leverage a
synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an
efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced
implementation supports a range of models and offers both non-persistent and
persistent deployment options, catering to diverse user scenarios from
interactive sessions to long-running applications. We present a detailed
benchmarking methodology, analyze the performance through latency-throughput
curves, and investigate scalability via load balancing. Our evaluations
demonstrate substantial improvements in throughput and latency across various
models and hardware configurations. We discuss our roadmap for future
enhancements, including broader model support and new hardware backends. The
DeepSpeed-FastGen code is readily available for community engagement and
contribution.