大型语言模型推理引擎综述：优化与效率视角

摘要

大型语言模型（LLMs）广泛应用于聊天机器人、代码生成器和搜索引擎中。诸如思维链、复杂推理和代理服务等工作负载通过反复调用模型显著增加了推理成本。尽管采用了并行化、压缩和缓存等优化方法来降低成本，但多样化的服务需求使得选择合适的方法变得困难。最近，专门的LLM推理引擎已成为将优化方法集成到面向服务的基础设施中的关键组件。然而，关于推理引擎的系统性研究仍然缺乏。本文对25个开源和商业推理引擎进行了全面评估。我们从易用性、易部署性、通用支持性、可扩展性以及对吞吐量和延迟敏感计算的适用性等方面考察了每个推理引擎。此外，我们通过调查每个推理引擎所支持的优化技术，探讨了其设计目标。同时，我们评估了开源推理引擎的生态系统成熟度，并处理了商业解决方案的性能和成本策略。我们概述了未来的研究方向，包括对基于LLM的复杂服务的支持、对各种硬件的支持以及增强的安全性，为研究人员和开发者在选择和设计优化的LLM推理引擎时提供了实用指导。我们还提供了一个公共仓库，以持续跟踪这一快速发展领域的最新进展： https://github.com/sihyeong/Awesome-LLM-Inference-Engine

English

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

大型语言模型推理引擎综述：优化与效率视角

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

摘要

Support