OLMoE：开放式专家混合语言模型

摘要

我们介绍了OLMoE，这是一个充分开放且最先进的语言模型，利用稀疏的专家混合模型（MoE）。OLMoE-1B-7B拥有70亿（B）参数，但每个输入标记仅使用10亿。我们在5000亿标记上对其进行预训练，并进一步调整以创建OLMoE-1B-7B-Instruct。我们的模型胜过所有具有类似活跃参数的现有模型，甚至超越像Llama2-13B-Chat和DeepSeekMoE-16B这样更大的模型。我们展示了关于MoE训练的各种实验，分析了我们模型中显示高度专业化的路由，并开源了我们工作的所有方面：模型权重、训练数据、代码和日志。

English

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

OLMoE：开放式专家混合语言模型

OLMoE: Open Mixture-of-Experts Language Models

摘要

Support