超参数对大型语言模型推理性能的影响：对vLLM和HuggingFace管道的评估

摘要

最近开源大型语言模型（LLMs）的激增使开发人员能够创建基于人工智能的解决方案，同时保持对隐私和合规等方面的控制，从而提供模型部署过程的治理和所有权。为了利用这些LLMs，需要推理引擎。这些引擎将模型的权重加载到可用资源（如GPU）上，并处理查询以生成响应。LLM的推理速度或性能对于实时应用至关重要，因为它每次推理计算数百万或数十亿次浮点运算。最近，出现了先进的推理引擎，如vLLM，它们融合了诸如高效内存管理之类的新颖机制，以实现最先进的性能。在本文中，我们分析了20个LLM的性能，特别是通过两个推理库（vLLM和HuggingFace的pipelines）生成的吞吐量（单位时间内生成的标记数）。我们调查了开发人员必须配置的各种超参数如何影响推理性能。我们的结果显示，吞吐量景观不规则，具有明显的峰值，突显了超参数优化对实现最大性能的重要性。我们还表明，在升级或降级用于推理的GPU模型时应用超参数优化可以将HuggingFace pipelines的吞吐量分别平均提高9.16%和13.7%。

English

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

超参数对大型语言模型推理性能的影响：对vLLM和HuggingFace管道的评估

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

摘要

Support