MobileNMT: 15MB와 30ms 내 번역 가능한 기술

초록

모바일 기기에서 신경망 기계 번역(NMT) 모델을 배포하는 것은 개인 정보 보호, 낮은 지연 시간, 오프라인 시나리오에서 필수적입니다. 높은 모델 용량을 위해 NMT 모델은 상당히 큰 편입니다. 이러한 모델을 제한된 저장 공간, 메모리, 계산 능력 및 전력 소비를 가진 기기에서 실행하는 것은 어려운 과제입니다. 기존 연구는 주로 FLOPs와 같은 단일 지표에 초점을 맞추거나 자동 회귀 디코딩에 적합하지 않은 일반 엔진에만 집중했습니다. 본 논문에서는 15MB와 30ms 내에 기기에서 번역을 수행할 수 있는 MobileNMT 시스템을 소개합니다. 양자화와 결합된 모델 압축을 위한 일련의 원칙을 제안합니다. 또한, INT8 및 디코딩에 친화적인 엔진을 구현합니다. 모델과 엔진의 공동 설계를 통해 기존 시스템과 비교하여 47.0배의 속도 향상과 99.5%의 메모리 절약을 달성하면서 BLEU 점수는 단 11.6%만 감소했습니다. 코드는 https://github.com/zjersey/Lightseq-ARM에서 공개되어 있습니다.

English

Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.

MobileNMT: 15MB와 30ms 내 번역 가능한 기술

MobileNMT: Enabling Translation in 15MB and 30ms

초록

Support