格羅夫混合專家模型：基於伴隨專家實現高效能與卓越的混合專家大型語言模型

摘要

專家混合（Mixture of Experts, MoE）架構是現代頂尖大規模語言模型（Large Language Models, LLMs）的基石。MoE模型通過稀疏參數激活實現了可擴展性。然而，傳統的MoE架構使用同質且大小一致的專家，無論輸入複雜度如何，均激活固定數量的參數，從而限制了計算效率。為克服這一限制，我們引入了Grove MoE，這是一種受異構big.LITTLE CPU架構啟發、包含不同大小專家的新穎架構。該架構具有動態激活機制的伴隨專家，能在保持可控計算開銷的同時擴展模型容量。基於此架構，我們提出了GroveMoE-Base和GroveMoE-Inst，這兩個33B參數的LLM模型是在Qwen3-30B-A3B-Base模型的中期訓練和後期訓練中應用升級策略開發而成。GroveMoE模型根據令牌複雜度動態激活3.14至3.28B參數，並實現了與相似甚至更大規模的頂尖開源模型相當的性能。

English

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

格羅夫混合專家模型：基於伴隨專家實現高效能與卓越的混合專家大型語言模型

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

摘要

Support