Grove MoE: Adjugate Experts를 활용한 효율적이고 우수한 MoE LLM을 향하여

초록

Mixture of Experts(MoE) 아키텍처는 현대 최첨단(State-of-the-Art, SOTA) 대규모 언어 모델(Large Language Models, LLMs)의 핵심 요소입니다. MoE 모델은 희소 매개변수 활성화를 통해 확장성을 가능하게 합니다. 그러나 기존의 MoE 아키텍처는 균일한 크기의 동종 전문가를 사용하며, 입력 복잡도와 관계없이 고정된 수의 매개변수를 활성화함으로써 계산 효율성을 제한합니다. 이러한 한계를 극복하기 위해, 우리는 이질적인 big.LITTLE CPU 아키텍처에서 영감을 받아 다양한 크기의 전문가를 통합한 새로운 아키텍처인 Grove MoE를 소개합니다. 이 아키텍처는 동적 활성화 메커니즘을 갖춘 새로운 adjugate 전문가를 특징으로 하며, 관리 가능한 계산 오버헤드를 유지하면서 모델 용량을 확장할 수 있게 합니다. 이 아키텍처를 기반으로, 우리는 중간 훈련 및 훈련 후 단계에서 Qwen3-30B-A3B-Base 모델에 업사이클링 전략을 적용하여 개발된 33B 매개변수의 LLM인 GroveMoE-Base와 GroveMoE-Inst를 제시합니다. GroveMoE 모델은 토큰 복잡도에 따라 3.14-3.28B 매개변수를 동적으로 활성화하며, 유사하거나 더 큰 규모의 SOTA 오픈소스 모델과 비슷한 성능을 달성합니다.

English

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Grove MoE: Adjugate Experts를 활용한 효율적이고 우수한 MoE LLM을 향하여

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

초록

Support