MobileMoE: オンデバイス混合エキスパートモデルのスケーリング

要旨

Mixture-of-Experts (MoE)は、数千億パラメータの言語モデルにおいて事実上の標準アーキテクチャとなっているが、サブビリオンスケールでのデバイス上展開におけるその利点はほとんど未探求のままである。このギャップを埋めるため、我々はMobileMoEを提案する。MobileMoEは、アクティブパラメータが10億未満（0.3〜0.9Bのアクティブ、1.3〜5.3Bの総数）のオンデバイスMoE言語モデルファミリであり、オンデバイスLLMの新たなパレートフロンティアを確立する。我々はまず、モバイルのメモリと計算制約の下でMoEアーキテクチャを共同最適化するオンデバイスMoEスケーリング則を定式化し、オンデバイスのスイートスポット——メモリと計算の両方で最適な、細粒度の共有エキスパートによる適度な疎性——を特定する。導出されたアーキテクチャに基づき、我々はMobileMoEを、事前学習、中間学習、命令ファインチューニング、量子化対応学習からなる4段階のレシピで訓練する。すべてのデータセットはオープンソースである。14のベンチマークにおいて、MobileMoEは、推論FLOPsが2〜4分の1で、主要なオンデバイス高密度LLMに匹敵またはそれを上回り、また、最先端MoEモデルOLMoE-1B-7Bと比較して最大60%少ないパラメータで同等かそれを上回る性能を示す。モバイル展開への最終段階を埋めるため、我々は、市販のスマートフォン上で、包括的なオンデバイスプロファイリングを伴う初の効率的なMoE推論を提供する。同等のINT4重みメモリにおいて、MobileMoE-Sは、高密度ベースラインMobileLLM-Proよりも、プリフィルで1.8〜3.8倍、デコードで2.2〜3.4倍高速である。

English

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.