驯服巨兽:高效大语言模型推理服务综述
Taming the Titans: A Survey of Efficient LLM Inference Serving
April 28, 2025
作者: Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
cs.AI
摘要
生成式人工智能领域的大型语言模型(LLMs)已取得显著进展,演变为复杂且多功能的工具,广泛应用于各个领域和应用场景。然而,其庞大的参数量带来的巨大内存开销,加之注意力机制的高计算需求,在实现LLM推理服务的低延迟与高吞吐量方面构成了重大挑战。近期,在突破性研究的推动下,这一领域的进展显著加速。本文全面综述了这些方法,涵盖基础实例级策略、深入集群级方案、新兴场景方向及其他重要但小众的领域。在实例层面,我们回顾了模型部署、请求调度、解码长度预测、存储管理及解耦范式。在集群层面,探讨了GPU集群部署、多实例负载均衡及云服务解决方案。针对新兴场景,围绕特定任务、模块及辅助方法进行了系统梳理。为确保全面性,我们还特别强调了几项虽小众却至关重要的领域。最后,本文展望了未来可能的研究方向,以进一步推动LLM推理服务领域的发展。
English
Large Language Models (LLMs) for Generative AI have achieved remarkable
progress, evolving into sophisticated and versatile tools widely adopted across
various domains and applications. However, the substantial memory overhead
caused by their vast number of parameters, combined with the high computational
demands of the attention mechanism, poses significant challenges in achieving
low latency and high throughput for LLM inference services. Recent
advancements, driven by groundbreaking research, have significantly accelerated
progress in this field. This paper provides a comprehensive survey of these
methods, covering fundamental instance-level approaches, in-depth cluster-level
strategies, emerging scenario directions, and other miscellaneous but important
areas. At the instance level, we review model placement, request scheduling,
decoding length prediction, storage management, and the disaggregation
paradigm. At the cluster level, we explore GPU cluster deployment,
multi-instance load balancing, and cloud service solutions. For emerging
scenarios, we organize the discussion around specific tasks, modules, and
auxiliary methods. To ensure a holistic overview, we also highlight several
niche yet critical areas. Finally, we outline potential research directions to
further advance the field of LLM inference serving.