FLAME-MoE：一個透明端到端的研究平台，專注於專家混合語言模型

摘要

近期诸如Gemini-1.5、DeepSeek-V3及Llama-4等大型语言模型，越来越多地采用了专家混合（Mixture-of-Experts, MoE）架构，该架构通过仅激活模型中的一小部分来处理每个标记，从而在效率与性能之间实现了良好的平衡。然而，学术界的研究人员仍缺乏一个完全开放、端到端的MoE平台，以深入探究模型的扩展性、路由机制及专家行为。为此，我们发布了FLAME-MoE，这是一个完全开源的研究套件，包含七个仅解码器模型，其活跃参数范围从3800万至17亿不等，其架构——包含64位专家、采用前8门控及2位共享专家——紧密映射了现代生产级大型语言模型的特点。所有训练数据管道、脚本、日志及检查点均公开，以确保实验的可重复性。在六项评估任务中，FLAME-MoE相较于使用相同浮点运算次数训练的密集基线模型，平均准确率提升了高达3.4个百分点。借助完整的训练轨迹透明度，我们进行了初步分析，结果表明：(i) 专家们逐渐专注于处理不同的标记子集，(ii) 共同激活矩阵保持稀疏，反映了专家使用的多样性，(iii) 路由行为在训练早期即趋于稳定。所有代码、训练日志及模型检查点均可通过https://github.com/cmu-flame/FLAME-MoE获取。

English

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

FLAME-MoE：一個透明端到端的研究平台，專注於專家混合語言模型

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

摘要

Support