MobileNMT：在15MB和30ms内实现翻译

摘要

在移动设备上部署神经机器翻译（NMT）模型对于隐私、低延迟和离线场景至关重要。由于NMT模型容量较大，将这些模型在设备上运行面临存储、内存、计算和功耗有限的挑战。现有工作要么只关注于单一指标如FLOPs，要么是通用引擎，不擅长自回归解码。本文提出了MobileNMT系统，能够在设备上以15MB和30ms进行翻译。我们提出了一系列模型压缩原则，结合量化。此外，我们实现了一种友好于INT8和解码的引擎。通过模型和引擎的协同设计，与现有系统相比，我们提高了47.0倍的速度，节省了99.5%的内存，仅损失了11.6%的BLEU。代码可在https://github.com/zjersey/Lightseq-ARM公开获取。

English

Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

MobileNMT：在15MB和30ms内实现翻译

MobileNMT: Enabling Translation in 15MB and 30ms

摘要

Support