超參數對大型語言模型推論性能的影響：對vLLM和HuggingFace管線的評估

摘要

最近開源大型語言模型（LLMs）的激增使開發人員能夠創建基於人工智慧的解決方案，同時保持對隱私和合規性等方面的控制，從而提供模型部署過程的治理和所有權。為了利用這些LLMs，需要推理引擎。這些引擎將模型的權重加載到可用資源（如GPU）上，並處理查詢以生成回應。LLM的推理速度或性能對於實時應用至關重要，因為它每次推理計算數百萬或數十億個浮點運算。最近，出現了先進的推理引擎，如vLLM，其中包含了高效的記憶體管理等新機制，以實現最先進的性能。在本文中，我們分析了20個LLMs的性能，特別是通過兩個推理庫（vLLM和HuggingFace的pipelines）生成的吞吐量（每單位時間生成的標記數）。我們調查了各種開發人員必須配置的超參數如何影響推理性能。我們的結果顯示，吞吐量景觀不規則，具有明顯的高峰，突顯了超參數優化的重要性以實現最大性能。我們還表明，在升級或降級用於推理的GPU模型時應用超參數優化，可以將HuggingFace pipelines的吞吐量平均提高9.16%和13.7%。

English

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

超參數對大型語言模型推論性能的影響：對vLLM和HuggingFace管線的評估

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

摘要

Support