ChatPaper.aiChatPaper

Transformer-Lite:高效部署大型语言模型于手机GPU

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

March 29, 2024
作者: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie
cs.AI

摘要

大型语言模型(LLM)广泛应用于智能手机上的智能助手、文本摘要、翻译及多模态任务。然而,当前设备端LLM部署方法的推理速度较慢,导致用户体验不佳。为提升设备GPU上LLM部署的效率,我们提出了四种优化技术:(a) 基于符号表达的方法,支持动态形状模型推理;(b) 操作符优化及执行优先级设置,以提高推理速度并减少手机卡顿;(c) 名为M0E4的FP4量化方法,降低反量化开销;(d) 基于子张量的技术,消除LLM推理后KV缓存复制的需要。此外,我们将这些方法实现在我们的移动推理引擎Transformer-Lite中,该引擎兼容高通和联发科处理器。我们使用架构和参数从2B到14B不等的LLM评估了Transformer-Lite的性能。具体而言,对于ChatGLM2 6B,我们实现了121 token/s的预填充速度和14 token/s的解码速度;对于较小的Gemma 2B,则分别为330 token/s和30 token/s。与基于CPU的FastLLM和基于GPU的MLC-LLM相比,我们的引擎在预填充速度上实现了超过10倍的加速,在解码速度上实现了2~3倍的加速。
English
The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

Summary

AI-Generated Summary

PDF353November 26, 2024