大規模言語モデルの推論性能に対するハイパーパラメータの影響：vLLMとHuggingFaceパイプラインの評価

要旨

最近のオープンソース大規模言語モデル（LLM）の急増により、開発者はプライバシーやコンプライアンスなどの側面を維持しながらAIベースのソリューションを作成できるようになり、モデル展開プロセスに対するガバナンスと所有権を提供しています。これらのLLMを活用するためには、推論エンジンが必要です。これらのエンジンは、モデルの重みをGPUなどの利用可能なリソースにロードし、クエリを処理して応答を生成します。LLMの推論速度、つまりパフォーマンスは、リアルタイムアプリケーションにとって極めて重要であり、推論ごとに数百万または数十億の浮動小数点演算を計算します。最近では、vLLMのような高度な推論エンジンが登場し、効率的なメモリ管理などの新しいメカニズムを組み込むことで、最先端のパフォーマンスを実現しています。本論文では、2つの推論ライブラリ（vLLMとHuggingFaceのパイプライン）を使用して、20のLLMのパフォーマンス、特にスループット（単位時間あたりに生成されるトークン数）を分析します。開発者が設定しなければならないさまざまなハイパーパラメータが推論パフォーマンスにどのように影響するかを調査します。その結果、スループットのランドスケープは不規則で、明確なピークがあることが明らかになり、最大パフォーマンスを達成するためのハイパーパラメータ最適化の重要性が浮き彫りになりました。また、推論に使用するGPUモデルをアップグレードまたはダウングレードする際にハイパーパラメータ最適化を適用することで、HuggingFaceパイプラインのスループットが平均で9.16％および13.7％向上することを示します。

English

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

大規模言語モデルの推論性能に対するハイパーパラメータの影響：vLLMとHuggingFaceパイプラインの評価

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

要旨

Support