OLMoE：開放式專家混合語言模型

摘要

我們介紹了OLMoE，這是一個充分開放且最先進的語言模型，利用稀疏的專家混合（MoE）。OLMoE-1B-7B具有70億（B）參數，但每個輸入標記僅使用10億參數。我們對其進行了5000億標記的預訓練，並進一步適應以創建OLMoE-1B-7B-Instruct。我們的模型在具有相似活躍參數的所有可用模型中表現優異，甚至超越了諸如Llama2-13B-Chat和DeepSeekMoE-16B等更大的模型。我們展示了有關MoE訓練的各種實驗，分析了我們模型中的路由，顯示高度專業化，並開源我們工作的所有方面：模型權重、訓練數據、代碼和日誌。

English

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

OLMoE：開放式專家混合語言模型

OLMoE: Open Mixture-of-Experts Language Models

摘要

Support