ChatPaper.aiChatPaper

Grove MoE:借助伴随专家实现高效卓越的混合专家大语言模型

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

August 11, 2025
作者: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li
cs.AI

摘要

专家混合(Mixture of Experts, MoE)架构是现代顶尖大规模语言模型(LLMs)的基石。MoE模型通过稀疏参数激活实现了良好的可扩展性。然而,传统MoE架构采用统一大小的同质专家,无论输入复杂度如何都激活固定数量的参数,从而限制了计算效率。为克服这一局限,我们引入了Grove MoE,一种受异构big.LITTLE CPU架构启发、包含不同大小专家的新颖架构。该架构引入了具有动态激活机制的伴随专家,在保持可控计算开销的同时扩展了模型容量。基于此架构,我们提出了GroveMoE-Base和GroveMoE-Inst,这是通过在中途训练和训练后对Qwen3-30B-A3B-Base模型应用升级策略开发的33B参数LLMs。GroveMoE模型根据令牌复杂度动态激活3.14至3.28B参数,并实现了与相似甚至更大规模的开源顶尖模型相媲美的性能。
English
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
PDF172August 12, 2025