MobileMoE: 온디바이스 혼합 전문가 모델 확장

초록

Mixture-of-Experts (MoE)은 수천억 개의 파라미터를 가진 언어 모델의 사실상 표준 아키텍처가 되었지만, 온디바이스 배포를 위한 10억 미만 규모에서의 장점은 아직 거의 탐구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 0.3~0.9B의 활성 파라미터와 1.3~5.3B의 전체 파라미터를 가진 온디바이스 MoE 언어 모델 제품군인 MobileMoE를 제시하며, 이는 온디바이스 LLM을 위한 새로운 파레토 최적 경계를 구축합니다. 먼저 모바일 메모리와 연산 제약 조건 하에서 MoE 아키텍처를 공동 최적화하는 온디바이스 MoE 스케일링 법칙을 공식화하여, 메모리와 연산 측면 모두에서 최적인 온디바이스 최적점—적절한 희소성과 세분화된 공유 전문가—을 식별합니다. 도출된 아키텍처를 바탕으로, 사전 학습, 중간 학습, 명령어 미세 조정, 양자화 인식 학습을 포함하는 4단계 레시피를 통해 MobileMoE를 학습하며, 이 모든 과정은 오픈소스 데이터셋에서 이루어집니다. 14개의 벤치마크에서 MobileMoE는 추론 FLOPs가 2~4배 적은 선도적인 온디바이스 밀집 LLM과 동등하거나 더 나은 성능을 보이며, 최대 60% 적은 파라미터로 최신 MoE인 OLMoE-1B-7B와 동등하거나 능가합니다. 모바일 배포까지의 마지막 단계를 연결하기 위해, 우리는 상용 스마트폰에서 효율적인 MoE 추론을 최초로 제공하며 포괄적인 온디바이스 프로파일링을 수행합니다. 동등한 INT4 가중치 메모리에서 MobileMoE-S는 밀집 기준선인 MobileLLM-Pro보다 프리필에서 1.8~3.8배, 디코드에서 2.2~3.4배 더 빠른 속도를 제공합니다.

English

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.