專家混合模型

摘要

我們介紹了 Mixtral 8x7B，一種稀疏專家混合（SMoE）語言模型。Mixtral 與 Mistral 7B 具有相同的架構，不同之處在於每個層由 8 個前饋塊（即專家）組成。對於每個標記，在每個層中，路由器網路選擇兩個專家來處理當前狀態並結合它們的輸出。儘管每個標記只看到兩個專家，但所選專家在每個時間步可能不同。因此，每個標記可以訪問 47B 參數，但在推論期間僅使用 13B 活躍參數。Mixtral 在上下文大小為 32k 標記的情況下進行了訓練，並在所有評估基準中優於或與 Llama 2 70B 和 GPT-3.5 相匹敵。特別是，在數學、代碼生成和多語言基準上，Mixtral 遠遠優於 Llama 2 70B。我們還提供了一個經過微調以遵循指示的模型，Mixtral 8x7B - Instruct，在人類基準上超越了 GPT-3.5 Turbo、Claude-2.1、Gemini Pro 和 Llama 2 70B - 聊天模型。基礎模型和指示模型均釋出為 Apache 2.0 許可證。

English

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

專家混合模型

Mixtral of Experts

摘要

Support