格羅夫混合專家模型:基於伴隨專家實現高效能與卓越的混合專家大型語言模型
Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts
August 11, 2025
作者: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li
cs.AI
摘要
專家混合(Mixture of Experts, MoE)架構是現代頂尖大規模語言模型(Large Language Models, LLMs)的基石。MoE模型通過稀疏參數激活實現了可擴展性。然而,傳統的MoE架構使用同質且大小一致的專家,無論輸入複雜度如何,均激活固定數量的參數,從而限制了計算效率。為克服這一限制,我們引入了Grove MoE,這是一種受異構big.LITTLE CPU架構啟發、包含不同大小專家的新穎架構。該架構具有動態激活機制的伴隨專家,能在保持可控計算開銷的同時擴展模型容量。基於此架構,我們提出了GroveMoE-Base和GroveMoE-Inst,這兩個33B參數的LLM模型是在Qwen3-30B-A3B-Base模型的中期訓練和後期訓練中應用升級策略開發而成。GroveMoE模型根據令牌複雜度動態激活3.14至3.28B參數,並實現了與相似甚至更大規模的頂尖開源模型相當的性能。
English
The Mixture of Experts (MoE) architecture is a cornerstone of modern
state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate
scalability by enabling sparse parameter activation. However, traditional MoE
architecture uses homogeneous experts of a uniform size, activating a fixed
number of parameters irrespective of input complexity and thus limiting
computational efficiency. To overcome this limitation, we introduce Grove MoE,
a novel architecture incorporating experts of varying sizes, inspired by the
heterogeneous big.LITTLE CPU architecture. This architecture features novel
adjugate experts with a dynamic activation mechanism, enabling model capacity
expansion while maintaining manageable computational overhead. Building on this
architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs
developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model
during mid-training and post-training. GroveMoE models dynamically activate
3.14-3.28B parameters based on token complexity and achieve performance
comparable to SOTA open-source models of similar or even larger size.