MobileMoE:在裝置上規模化混合專家模型
MobileMoE: Scaling On-Device Mixture of Experts
May 26, 2026
作者: Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi
cs.AI
摘要
混合專家模型(MoE)已成為千億參數語言模型的事實標準架構,然而其在次十億參數規模下用於裝置端部署的優勢仍鮮少被探討。為填補此空白,我們提出 MobileMoE:一系列活躍參數低於十億(0.3-0.9B 活躍參數,總參數 1.3-5.3B)的裝置端 MoE 語言模型,為裝置端大型語言模型建立新的帕累托前沿。我們首先制定裝置端 MoE 縮放定律,在行動裝置記憶體與計算限制下聯合優化 MoE 架構,找出裝置端的理想甜蜜點——具細粒度與共享專家的適度稀疏性——同時達到記憶體與計算最適化。基於推導出的架構,我們採用四階段訓練配方(涵蓋預訓練、中期訓練、指令微調與量化感知訓練)訓練 MobileMoE,所有階段皆使用開源資料集。在 14 個基準測試中,MobileMoE 以減少 2 至 4 倍的推理浮點運算次數,匹配或超越領先的裝置端密集大型語言模型;並以最多減少 60% 的參數量,匹配或超越最先進的 MoE 模型 OLMoE-1B-7B。為彌合通往行動部署的最後一哩路,我們提供首個在商用智慧型手機上高效執行 MoE 推論的方案,並進行全面的裝置端效能分析。在可比的 INT4 權重記憶體下,MobileMoE-S 的預填充速度為密集基準模型 MobileLLM-Pro 的 1.8 至 3.8 倍,解碼速度則為 2.2 至 3.4 倍。
English
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.