FLAME-MoE：一个面向专家混合语言模型的透明端到端研究平台

摘要

近期的大型语言模型，如Gemini-1.5、DeepSeek-V3和Llama-4，越来越多地采用专家混合（Mixture-of-Experts, MoE）架构，这种架构通过仅为每个令牌激活模型的一部分，实现了效率与性能的出色平衡。然而，学术界仍缺乏一个完全开放、端到端的MoE平台，用于研究模型扩展、路由机制及专家行为。为此，我们发布了FLAME-MoE，这是一个完全开源的研究套件，包含七个仅解码器模型，活跃参数规模从3800万到17亿不等，其架构——包含64个专家、采用前8门控及2个共享专家——紧密贴合现代生产级大语言模型的设计。所有训练数据管道、脚本、日志及检查点均公开，以确保实验的可重复性。在六项评估任务中，FLAME-MoE相较于同等计算量（FLOPs）下的密集基线模型，平均准确率提升了高达3.4个百分点。借助训练全程的透明度，我们进行了初步分析，结果表明：(i) 专家逐渐专注于不同的令牌子集，(ii) 共同激活矩阵保持稀疏，反映了专家的多样化使用，(iii) 路由行为在训练早期即趋于稳定。所有代码、训练日志及模型检查点均可通过https://github.com/cmu-flame/FLAME-MoE获取。

English

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

FLAME-MoE：一个面向专家混合语言模型的透明端到端研究平台

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

摘要

Support