超参数对大型语言模型推理性能的影响:对vLLM和HuggingFace管道的评估
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
August 2, 2024
作者: Matias Martinez
cs.AI
摘要
最近开源大型语言模型(LLMs)的激增使开发人员能够创建基于人工智能的解决方案,同时保持对隐私和合规等方面的控制,从而提供模型部署过程的治理和所有权。为了利用这些LLMs,需要推理引擎。这些引擎将模型的权重加载到可用资源(如GPU)上,并处理查询以生成响应。LLM的推理速度或性能对于实时应用至关重要,因为它每次推理计算数百万或数十亿次浮点运算。最近,出现了先进的推理引擎,如vLLM,它们融合了诸如高效内存管理之类的新颖机制,以实现最先进的性能。在本文中,我们分析了20个LLM的性能,特别是通过两个推理库(vLLM和HuggingFace的pipelines)生成的吞吐量(单位时间内生成的标记数)。我们调查了开发人员必须配置的各种超参数如何影响推理性能。我们的结果显示,吞吐量景观不规则,具有明显的峰值,突显了超参数优化对实现最大性能的重要性。我们还表明,在升级或降级用于推理的GPU模型时应用超参数优化可以将HuggingFace pipelines的吞吐量分别平均提高9.16%和13.7%。
English
The recent surge of open-source large language models (LLMs) enables
developers to create AI-based solutions while maintaining control over aspects
such as privacy and compliance, thereby providing governance and ownership of
the model deployment process. To utilize these LLMs, inference engines are
needed. These engines load the model's weights onto available resources, such
as GPUs, and process queries to generate responses. The speed of inference, or
performance, of the LLM, is critical for real-time applications, as it computes
millions or billions of floating point operations per inference. Recently,
advanced inference engines such as vLLM have emerged, incorporating novel
mechanisms such as efficient memory management to achieve state-of-the-art
performance. In this paper, we analyze the performance, particularly the
throughput (tokens generated per unit of time), of 20 LLMs using two inference
libraries: vLLM and HuggingFace's pipelines. We investigate how various
hyperparameters, which developers must configure, influence inference
performance. Our results reveal that throughput landscapes are irregular, with
distinct peaks, highlighting the importance of hyperparameter optimization to
achieve maximum performance. We also show that applying hyperparameter
optimization when upgrading or downgrading the GPU model used for inference can
improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%,
respectively.