专家混合模型

摘要

我们介绍了Mixtral 8x7B，一种稀疏专家混合（SMoE）语言模型。Mixtral与Mistral 7B具有相同的架构，不同之处在于每个层由8个前馈块（即专家）组成。对于每个令牌，在每个层中，路由网络选择两个专家来处理当前状态并组合它们的输出。尽管每个令牌只看到两个专家，但所选专家在每个时间步可能会不同。因此，每个令牌可以访问47B参数，但在推断过程中仅使用13B活跃参数。Mixtral在上下文大小为32k令牌的情况下进行了训练，并在所有评估基准测试中表现优于或与Llama 2 70B和GPT-3.5相匹配。特别是，在数学、代码生成和多语言基准测试中，Mixtral远远优于Llama 2 70B。我们还提供了一个经过微调以遵循指令的模型Mixtral 8x7B - Instruct，它在人类基准测试中超越了GPT-3.5 Turbo、Claude-2.1、Gemini Pro和Llama 2 70B - 聊天模型。基础模型和指令模型均在Apache 2.0许可下发布。

English

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

专家混合模型

Mixtral of Experts

摘要

Support