대형 언어 모델 추론에서 하이퍼파라미터의 영향: vLLM과 HuggingFace 파이프라인의 평가

초록

최근에 급증한 오픈 소스 대형 언어 모델(Large Language Models, LLMs)은 개발자들이 AI 기반 솔루션을 만들면서 개인정보 보호 및 규정 준수와 같은 측면을 통제할 수 있게 하여 모델 배포 과정의 지배권과 소유권을 제공합니다. 이러한 LLMs를 활용하기 위해서는 추론 엔진이 필요합니다. 이 엔진은 모델의 가중치를 GPU와 같은 사용 가능한 자원에 로드하고 쿼리를 처리하여 응답을 생성합니다. LLM의 추론 속도 또는 성능은 실시간 응용 프로그램에서 중요한데, 이는 추론 당 수백만 또는 수십억의 부동 소수점 연산을 수행하기 때문입니다. 최근에는 효율적인 메모리 관리와 같은 혁신적인 메커니즘을 통해 최첨단 성능을 달성하기 위한 vLLM과 같은 고급 추론 엔진이 등장했습니다. 본 논문에서는 vLLM과 HuggingFace의 파이프라인 라이브러리를 사용하여 20개의 LLM의 성능, 특히 처리량(시간 당 생성된 토큰)을 분석합니다. 개발자가 구성해야 하는 다양한 하이퍼파라미터가 추론 성능에 어떻게 영향을 미치는지 조사합니다. 결과는 처리량 랜드스케이프가 불규칙하며, 뚜렷한 피크가 있어 하이퍼파라미터 최적화의 중요성을 강조합니다. 또한 추론에 사용되는 GPU 모델을 업그레이드하거나 다운그레이드할 때 하이퍼파라미터 최적화를 적용하면 HuggingFace 파이프라인의 처리량이 각각 평균 9.16% 및 13.7% 향상됨을 보여줍니다.

English

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

대형 언어 모델 추론에서 하이퍼파라미터의 영향: vLLM과 HuggingFace 파이프라인의 평가

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

초록

Support