MobileNMT: 15MBと30msでの翻訳を実現

要旨

モバイルデバイス上でのNMT（ニューラル機械翻訳）モデルの展開は、プライバシー、低遅延、およびオフラインシナリオにおいて重要である。高いモデル容量を実現するため、NMTモデルは比較的大きなサイズとなる。これらのモデルをデバイス上で実行することは、限られたストレージ、メモリ、計算能力、および電力消費の中で課題となる。既存の研究は、FLOPsなどの単一の指標に焦点を当てるか、または自己回帰デコードに適していない汎用エンジンに限定されている。本論文では、15MBと30msでデバイス上で翻訳を実行可能なMobileNMTシステムを提案する。量子化と組み合わせたモデル圧縮のための一連の原則を提示し、さらにINT8とデコードに適したエンジンを実装する。モデルとエンジンの共同設計により、既存システムと比較して47.0倍の高速化と99.5%のメモリ節約を実現し、BLEUスコアの損失はわずか11.6%に抑えた。コードはhttps://github.com/zjersey/Lightseq-ARMで公開されている。

English

Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

MobileNMT: 15MBと30msでの翻訳を実現

MobileNMT: Enabling Translation in 15MB and 30ms

要旨

Support