MobileMoE: Opschaling van een Apparaatgebonden Mengsel van Experts

Samenvatting

Mixture-of-Experts (MoE) is de de facto architectuur geworden voor taalmodellen met honderden miljarden parameters, maar de voordelen ervan op sub-miljard schaal voor implementatie op apparaten zijn grotendeels onontgonnen. Om deze kloof te dichten presenteren we MobileMoE, een familie van on-device MoE-taalmodellen met sub-miljard actieve parameters (0,3-0,9B actief en 1,3-5,3B totaal) die een nieuw Pareto-grensvlak vestigen voor on-device LLM's. We formuleren eerst een on-device MoE-schaalwet die de MoE-architectuur gezamenlijk optimaliseert onder mobiele geheugen- en rekenbeperkingen, en identificeren een on-device sweet spot – gematigde spariteit met fijnmazige en gedeelde experts – die tegelijkertijd geheugen- en rekenoptimaal is. Voortbouwend op de afgeleide architecturen trainen we MobileMoE met een vierfasenrecept dat voortraining, mid-training, instructie-finetuning en kwantiseringsbewuste training omvat, allemaal op open-source datasets. Over 14 benchmarks heen evenaart of overtreft MobileMoE de toonaangevende on-device dichte LLM's met 2-4 keer minder inferentie-FLOP's, en evenaart of overtreft het de state-of-the-art MoE OLMoE-1B-7B met tot 60% minder parameters. Om de laatste stap naar mobiele implementatie te overbruggen, bieden we de eerste efficiënte MoE-inferentie op gangbare smartphones met uitgebreide on-device profilering. Bij vergelijkbaar INT4-gewichtgeheugen levert MobileMoE-S 1,8-3,8 keer snellere prefill en 2,2-3,4 keer snellere decode dan de dichte baseline MobileLLM-Pro.

English

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.