Grove MoE：借助伴随专家实现高效卓越的混合专家大语言模型

摘要

专家混合（Mixture of Experts, MoE）架构是现代顶尖大规模语言模型（LLMs）的基石。MoE模型通过稀疏参数激活实现了良好的可扩展性。然而，传统MoE架构采用统一大小的同质专家，无论输入复杂度如何都激活固定数量的参数，从而限制了计算效率。为克服这一局限，我们引入了Grove MoE，一种受异构big.LITTLE CPU架构启发、包含不同大小专家的新颖架构。该架构引入了具有动态激活机制的伴随专家，在保持可控计算开销的同时扩展了模型容量。基于此架构，我们提出了GroveMoE-Base和GroveMoE-Inst，这是通过在中途训练和训练后对Qwen3-30B-A3B-Base模型应用升级策略开发的33B参数LLMs。GroveMoE模型根据令牌复杂度动态激活3.14至3.28B参数，并实现了与相似甚至更大规模的开源顶尖模型相媲美的性能。

English

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Grove MoE：借助伴随专家实现高效卓越的混合专家大语言模型

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

摘要

Support