Transformer-Lite：在行動電話GPU上高效部署大型語言模型

摘要

大型語言模型（LLM）被廣泛應用於智能助手、文本摘要、翻譯和手機上的多模式任務。然而，目前用於設備上LLM部署的方法導致推理速度緩慢，影響使用者體驗。為了在設備GPU上實現高效的LLM部署，我們提出了四種優化技術：（a）基於符號表達的方法來支持動態形狀模型推理；（b）運算優化和執行優先級設置以提高推理速度並減少手機延遲；（c）一種稱為M0E4的FP4量化方法，以減少去量化開銷；（d）一種基於子張量的技術，消除LLM推理後需要複製KV緩存的需求。此外，我們在我們的移動推理引擎Transformer-Lite中實現了這些方法，該引擎與高通和聯發科處理器兼容。我們使用不同架構和參數範圍從2B到14B的LLM對Transformer-Lite的性能進行了評估。具體來說，我們實現了ChatGLM2 6B的預填充速度和解碼速度分別為121 token/s和14 token/s，以及較小的Gemma 2B分別為330 token/s和30 token/s。與基於CPU的FastLLM和基於GPU的MLC-LLM相比，我們的引擎在預填充速度方面實現了超過10倍的加速，解碼速度方面實現了2~3倍的加速。

English

The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

Transformer-Lite：在行動電話GPU上高效部署大型語言模型

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

摘要

Summary

Support

Support