ChatPaper.aiChatPaper

推理语言模型服务实践揭秘:一项实证研究

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

October 21, 2025
作者: Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu
cs.AI

摘要

推理大语言模型(RLLM)在解决数学、编程等复杂推理任务时已被证明相较于通用大语言模型具有竞争力。然而,RLLM的实际服务性能与行为特征仍待探索,这可能影响其在真实场景中的部署与应用。为填补这一空白,本文对RLLM服务展开全面研究。我们首先通过试点研究对比RLLM与传统LLM的服务性能,发现其服务行为存在若干显著差异:(1)内存占用显著且存在波动;(2)存在滞后请求现象;(3)运行时间具有自适应性;(4)表现出领域偏好。随后我们进一步探究现有推理优化技术对RLLM的有效性,主要结论包括:模型量化方法和推测解码技术可在较小影响RLLM精度的前提下提升系统效率,而前缀缓存、KV缓存量化等技术可能对小规模RLLM的精度或服务性能产生负面影响。最后,我们采用伽马分布模拟真实工作负载进行验证,跨数据集的实证结果与关于RLLM服务的主要发现一致。本研究旨在为学术界和工业界推进RLLM推理服务提供实践洞见。
English
The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.
PDF71December 2, 2025