大型語言模型推理引擎綜述：優化與效率的視角

摘要

大型語言模型（LLMs）已廣泛應用於聊天機器人、代碼生成器及搜索引擎中。諸如思維鏈、複雜推理和代理服務等工作負載，因需多次調用模型而顯著增加了推理成本。為降低成本，業界採用了並行化、壓縮和緩存等優化方法，但多樣化的服務需求使得選擇合適的方法變得困難。近期，專用的LLM推理引擎已成為將這些優化方法整合至面向服務基礎設施的關鍵組件。然而，針對推理引擎的系統性研究仍顯不足。本文對25個開源及商業推理引擎進行了全面評估，從易用性、部署便捷性、通用性支持、可擴展性以及對吞吐量和延遲敏感計算的適用性等方面逐一審視。此外，我們通過探究各推理引擎所支持的優化技術，深入挖掘其設計目標。同時，我們評估了開源推理引擎的生態成熟度，並處理了商業解決方案在性能與成本策略上的考量。我們勾勒了未來研究方向，包括對基於LLM的複雜服務的支持、多樣化硬件的兼容性以及安全性的提升，為研究人員和開發者在選擇與設計優化的LLM推理引擎時提供實用指導。我們還建立了一個公共資源庫，持續追蹤這一快速發展領域的最新動態：https://github.com/sihyeong/Awesome-LLM-Inference-Engine。

English

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

大型語言模型推理引擎綜述：優化與效率的視角

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

摘要

Support