Grove MoE: 随伴エキスパートを用いた効率的で優れたMoE LLMの実現に向けて

要旨

Mixture of Experts（MoE）アーキテクチャは、現代の最先端（SOTA）大規模言語モデル（LLM）の基盤をなすものである。MoEモデルは、スパースなパラメータ活性化を可能にすることでスケーラビリティを促進する。しかし、従来のMoEアーキテクチャでは、均一なサイズの同種のエキスパートを使用し、入力の複雑さに関わらず固定数のパラメータを活性化するため、計算効率が制限されていた。この制限を克服するため、我々は異種のbig.LITTLE CPUアーキテクチャに着想を得た、可変サイズのエキスパートを組み込んだ新たなアーキテクチャであるGrove MoEを提案する。このアーキテクチャは、動的活性化メカニズムを備えた新規のadjugateエキスパートを特徴とし、計算オーバーヘッドを管理可能な範囲に保ちつつモデル容量を拡張する。このアーキテクチャを基に、Qwen3-30B-A3B-Baseモデルに対して中盤および終盤のトレーニング中にアップサイクリング戦略を適用して開発した33BパラメータのLLMであるGroveMoE-BaseとGroveMoE-Instを提示する。GroveMoEモデルは、トークンの複雑さに基づいて3.14-3.28Bのパラメータを動的に活性化し、類似またはそれ以上のサイズのSOTAオープンソースモデルに匹敵する性能を達成する。

English

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Grove MoE: 随伴エキスパートを用いた効率的で優れたMoE LLMの実現に向けて

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

要旨

Support