전문가 혼합 모델

초록

우리는 Sparse Mixture of Experts(SMoE) 언어 모델인 Mixtral 8x7B를 소개한다. Mixtral은 Mistral 7B와 동일한 아키텍처를 가지며, 각 레이어가 8개의 피드포워드 블록(즉, 전문가)으로 구성된다는 점에서 차이가 있다. 각 토큰에 대해, 각 레이어에서 라우터 네트워크는 현재 상태를 처리하고 그들의 출력을 결합할 두 명의 전문가를 선택한다. 각 토큰이 두 명의 전문가만을 보게 되더라도, 선택된 전문가는 각 시간 단계에서 달라질 수 있다. 결과적으로, 각 토큰은 47B개의 파라미터에 접근할 수 있지만, 추론 중에는 13B개의 활성 파라미터만 사용한다. Mixtral은 32k 토큰의 컨텍스트 크기로 훈련되었으며, 평가된 모든 벤치마크에서 Llama 2 70B와 GPT-3.5를 능가하거나 동등한 성능을 보인다. 특히, Mixtral은 수학, 코드 생성, 다국어 벤치마크에서 Llama 2 70B를 크게 앞선다. 또한, 지시를 따르도록 미세 조정된 모델인 Mixtral 8x7B - Instruct를 제공하며, 이 모델은 인간 벤치마크에서 GPT-3.5 Turbo, Claude-2.1, Gemini Pro, 그리고 Llama 2 70B - chat 모델을 능가한다. 기본 모델과 지시 모델 모두 Apache 2.0 라이선스 하에 공개되었다.

English

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

전문가 혼합 모델

Mixtral of Experts

초록

Support