在 CPU 上针对大型语言模型的推理性能优化

摘要

大型语言模型（LLMs）展现出在各种任务中的卓越性能和巨大潜力。然而，在低资源环境中部署性能优越的LLMs已经引起了行业的广泛关注。当GPU硬件资源有限时，我们可以在CPU上探索替代选项。为了减轻财务负担并缓解硬件资源带来的限制，优化推理性能是必要的。在本文中，我们介绍了一种易于部署的推理性能优化解决方案，旨在加速CPU上的LLMs。在这个解决方案中，我们实现了一种有效的方法来减少KV缓存大小，同时确保精度。我们提出了一种分布式推理优化方法，并基于oneAPI Collective Communications Library实现了它。此外，我们提出了针对CPU上LLMs的优化方法，并为最常用的模型进行了定制优化。该代码已在https://github.com/intel/xFasterTransformer 上开源。

English

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

在 CPU 上针对大型语言模型的推理性能优化

Inference Performance Optimization for Large Language Models on CPUs

摘要

Support