盘古Pro MoE：分组专家混合实现高效稀疏化

摘要

大语言模型中专家混合模型（MoE）的兴起，预示着以较小的执行成本换取更大的模型参数量和学习能力，因为每个输入标记仅激活一小部分参数。然而，普遍观察到某些专家被激活的频率远高于其他专家，导致在不同设备上并行运行专家时系统效率低下。因此，我们引入了分组专家混合模型（MoGE），它在选择过程中对专家进行分组，并从根本上比MoE更好地平衡专家的工作负载。它通过约束标记在每个预定义的专家组内激活相同数量的专家来实现这一点。当模型执行分布在多个设备上时，这种架构设计确保了设备间的计算负载均衡，显著提升了吞吐量，特别是在推理阶段。此外，我们在昇腾NPU上构建了基于MoGE的稀疏模型——盘古Pro MoE，其总参数量达720亿，每个标记激活160亿参数。通过广泛的系统仿真研究，盘古Pro MoE的配置针对昇腾300I Duo和800I A2进行了优化。实验表明，MoGE确实在昇腾NPU上实现了更好的专家负载平衡和更高效的模型训练与推理执行。盘古Pro MoE的推理性能达到每卡1148标记/秒，通过推测加速可进一步提升至每卡1528标记/秒，超越了同级别的32B和72B密集模型。此外，我们在昇腾300I Duo上实现了优异的模型推理性价比。研究表明，昇腾NPU能够通过大规模并行化训练盘古Pro MoE，使其成为总参数量低于1000亿类别中的领先模型，超越了如GLM-Z1-32B和Qwen3-32B等知名开源模型。

English

The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.

盘古Pro MoE：分组专家混合实现高效稀疏化

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

摘要

Support