MobileNMT：實現在15MB和30毫秒內的翻譯

摘要

在移動設備上部署神經機器翻譯模型對於隱私、低延遲和離線情境至關重要。由於神經機器翻譯模型的容量較大，因此在設備上運行這些模型面臨著存儲、內存、計算和功耗有限的挑戰。現有研究要麼僅專注於單一指標，如FLOPs，要麼是對自回歸解碼效果不佳的通用引擎。本文提出了MobileNMT，一個能夠在設備上以15MB和30ms進行翻譯的系統。我們提出了一系列模型壓縮原則，結合量化。此外，我們實現了一個對INT8和解碼友好的引擎。通過模型和引擎的共同設計，與現有系統相比，我們的速度提高了47.0倍，記憶體節省了99.5%，僅損失了11.6%的BLEU。代碼可在https://github.com/zjersey/Lightseq-ARM 公開獲取。

English

Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

MobileNMT：實現在15MB和30毫秒內的翻譯

MobileNMT: Enabling Translation in 15MB and 30ms

摘要

Support