CPU에서 대규모 언어 모델의 추론 성능 최적화

초록

대규모 언어 모델(LLMs)은 다양한 작업에서 탁월한 성능과 광범위한 잠재력을 보여주고 있습니다. 그러나 저자원 환경에서 고성능 LLMs의 배포는 업계에서 상당한 관심을 받고 있습니다. GPU 하드웨어 자원이 제한적일 때, CPU 상에서의 대안을 탐구할 수 있습니다. 재정적 부담을 완화하고 하드웨어 자원으로 인한 제약을 줄이기 위해, 추론 성능 최적화가 필요합니다. 본 논문에서는 CPU에서 LLMs를 가속화하기 위해 쉽게 배포할 수 있는 추론 성능 최적화 솔루션을 소개합니다. 이 솔루션에서는 정확도를 보장하면서 KV 캐시 크기를 줄이는 효과적인 방법을 구현합니다. 또한, 분산 추론 최적화 접근 방식을 제안하고 이를 oneAPI Collective Communications Library를 기반으로 구현합니다. 더 나아가, CPU 상의 LLMs를 위한 최적화 접근 방식을 제안하고 가장 일반적으로 사용되는 모델에 맞춤형 최적화를 수행합니다. 코드는 https://github.com/intel/xFasterTransformer에서 오픈소스로 제공됩니다.

English

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

CPU에서 대규모 언어 모델의 추론 성능 최적화

Inference Performance Optimization for Large Language Models on CPUs

초록

Support