CPU上での大規模言語モデルの推論性能最適化

要旨

大規模言語モデル（LLMs）は、多様なタスクにおいて卓越した性能と広範な可能性を示しています。しかし、リソースが限られた環境での高性能なLLMsの展開は、業界で大きな注目を集めています。GPUハードウェアリソースが限られている場合、CPU上での代替オプションを探ることができます。財務的負担を軽減し、ハードウェアリソースによる制約を緩和するためには、推論性能の最適化が必要です。本論文では、CPU上でLLMsを高速化するための容易に展開可能な推論性能最適化ソリューションを紹介します。このソリューションでは、精度を保証しながらKVキャッシュサイズを削減する効果的な方法を実装しています。分散推論最適化アプローチを提案し、oneAPI Collective Communications Libraryに基づいて実装しました。さらに、CPU上でのLLMsの最適化アプローチを提案し、最も一般的に使用されるモデルに対して特化した最適化を実施しています。コードはhttps://github.com/intel/xFasterTransformerでオープンソース化されています。

English

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

CPU上での大規模言語モデルの推論性能最適化

Inference Performance Optimization for Large Language Models on CPUs

要旨

Support