盘古Pro MoE：分组专家混合实现高效稀疏化

摘要

在大型語言模型中，專家混合模型（Mixture of Experts, MoE）的興起，以較低的執行成本代價，換取了更大的模型參數規模和學習能力，因為每個輸入的token僅激活一小部分參數。然而，普遍觀察到某些專家被激活的頻率遠高於其他專家，導致在並行運行於不同設備時系統效率低下。因此，我們引入了分組專家混合模型（Mixture of Grouped Experts, MoGE），它在專家選擇過程中進行分組，並在根本上比MoE更好地平衡了專家的工作負載。它限制每個預定義的專家組內，token激活的專家數量相等。當模型執行分佈在多個設備上時，這種架構設計確保了跨設備的計算負載平衡，顯著提升了吞吐量，特別是在推理階段。此外，我們在昇騰NPU上構建了基於MoGE的稀疏模型——盤古Pro MoE，總參數達720億，每個token激活160億參數。通過廣泛的系統模擬研究，盤古Pro MoE的配置針對昇騰300I Duo和800I A2進行了優化。我們的實驗表明，MoGE確實能在昇騰NPU上實現更好的專家負載平衡，並在模型訓練和推理中帶來更高效的執行。盤古Pro MoE的推理性能達到每卡1148 tokens/s，並可通過推測加速進一步提升至每卡1528 tokens/s，超越了可比較的32B和72B密集模型。此外，我們在昇騰300I Duo上實現了卓越的模型推理性價比。我們的研究顯示，昇騰NPU能夠通過大規模並行化訓練盤古Pro MoE，使其成為總參數低於100B類別中的領先模型，超越了如GLM-Z1-32B和Qwen3-32B等知名開源模型。

English

The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.

盘古Pro MoE：分组专家混合实现高效稀疏化

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

摘要

Support