推理语言模型服务揭秘：一项实证研究

摘要

推理大语言模型（RLLM）在解决数学、编程等复杂推理任务时已被证明相较于通用大语言模型具有竞争优势。然而，RLLM的服务性能与行为特征仍缺乏系统研究，这可能影响其在真实场景中的部署与应用。为填补这一空白，本文对RLLM服务展开全面研究。我们首先通过试点研究对比RLLM与传统LLM的服务性能，发现其服务行为存在若干显著差异：（1）内存占用显著且存在波动；（2）存在滞后请求现象；（3）运行时间具有自适应性；（4）呈现领域偏好特征。随后我们深入探究现有推理优化技术对RLLM的有效性，主要结论包括：模型量化方法和推测解码技术能以较小精度损失提升服务系统效率，而前缀缓存及KV缓存量化可能对小规模RLLM的精度或服务性能产生负面影响。最后，我们采用伽马分布模拟的真实工作负载进行验证，跨数据集的实证结果表明实际工作负载下的评估结果与RLLM服务核心发现一致。本研究旨在为学术界和工业界推进RLLM推理服务提供实践洞见。

English

The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.

推理语言模型服务揭秘：一项实证研究

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

摘要

Support