ChatPaper.aiChatPaper

Transformer-Lite:在行動電話GPU上高效部署大型語言模型

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

March 29, 2024
作者: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie
cs.AI

摘要

大型語言模型(LLM)被廣泛應用於智能助手、文本摘要、翻譯和手機上的多模式任務。然而,目前用於設備上LLM部署的方法導致推理速度緩慢,影響使用者體驗。為了在設備GPU上實現高效的LLM部署,我們提出了四種優化技術:(a)基於符號表達的方法來支持動態形狀模型推理;(b)運算優化和執行優先級設置以提高推理速度並減少手機延遲;(c)一種稱為M0E4的FP4量化方法,以減少去量化開銷;(d)一種基於子張量的技術,消除LLM推理後需要複製KV緩存的需求。此外,我們在我們的移動推理引擎Transformer-Lite中實現了這些方法,該引擎與高通和聯發科處理器兼容。我們使用不同架構和參數範圍從2B到14B的LLM對Transformer-Lite的性能進行了評估。具體來說,我們實現了ChatGLM2 6B的預填充速度和解碼速度分別為121 token/s和14 token/s,以及較小的Gemma 2B分別為330 token/s和30 token/s。與基於CPU的FastLLM和基於GPU的MLC-LLM相比,我們的引擎在預填充速度方面實現了超過10倍的加速,解碼速度方面實現了2~3倍的加速。
English
The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

Summary

AI-Generated Summary

PDF353November 26, 2024