EMO:面向涌现模块化的专家混合预训练模型
EMO: Pretraining Mixture of Experts for Emergent Modularity
May 7, 2026
作者: Ryan Wang, Akshita Bhagia, Sewon Min
cs.AI
摘要
大型语言模型通常以整体系统形式部署,即使应用仅需特定功能子集(如代码、数学或领域知识)时仍需调用完整模型。混合专家模型(MoE)表面上提供了一种替代方案——仅针对每个输入激活部分专家,但实践中发现,若将推理过程限制在特定领域的专家子集内,会导致性能严重下降。这限制了其在内存受限环境中的实用性,尤其随着模型规模增大和稀疏性增强。我们提出EMO模型,这是一种专为模块化设计的MoE架构——支持专家子集的独立使用与组合,且无需人工预设先验条件。
我们的核心思路是促使相似领域的标记(token)依赖相似的专家群组。由于同一文档内的标记通常共享领域特征,EMO限制它们从共享专家池中选择专家,同时允许不同文档使用不同的专家池。这一简单约束仅通过预训练中的文档边界划分,即可促使连贯的专家分组自然形成。我们在1万亿标记上预训练了包含10亿激活参数、140亿总参数的EMO模型。作为完整模型时,其性能与标准MoE相当。关键在于,它能实现选择性专家调用:仅保留25%(12.5%)专家时,性能绝对跌幅仅1%(3%),而同等条件下标准MoE模型已失效。
进一步研究发现,EMO中的专家子集在语义层面(如数学、代码等领域)呈现专业化特征,这与标准MoE中观察到的底层句法专业化形成鲜明对比。总体而言,我们的研究成果为大型稀疏模型的模块化、内存高效部署开辟了新路径,并为可组合架构创造了新的可能性。
English
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.