MobileMoE: 扩展设备端混合专家模型

摘要

混合专家（MoE）已成为千亿参数语言模型的事实标准架构，但其在亚十亿参数规模下用于设备端部署的优势仍鲜有探索。为填补这一空白，我们提出MobileMoE——一个面向设备端的MoE语言模型系列，其活跃参数规模低于十亿（0.3-0.9B活跃参数，总参数1.3-5.3B），为设备端大语言模型建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律，该定律在移动内存和计算约束下联合优化MoE架构，识别出设备端的最佳平衡点——具有细粒度与共享专家的适度稀疏性——可在内存和计算上同时达到最优。基于推导出的架构，我们采用包含预训练、中期训练、指令微调和量化感知训练的四阶段流程来训练MobileMoE，所有阶段均使用开源数据集。在14个基准测试中，MobileMoE以2-4倍更少的推理FLOPs匹配或超越领先的设备端稠密大语言模型，并以高达60%的参数减少匹配或超越当前最先进的MoE模型OLMoE-1B-7B。为打通移动部署的最后一公里，我们首次在商用智能手机上实现了高效的MoE推理，并进行了全面的设备端性能评测。在相当的INT4权重量化内存下，MobileMoE-S的预填充速度比稠密基线模型MobileLLM-Pro快1.8-3.8倍，解码速度快2.2-3.4倍。

English

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.