전문가 업사이클링: 혼합 전문가 모델의 계산 효율성 한계 이동

초록

전문가 혼합(MoE)은 대규모 언어 모델의 확장을 위한 주요 아키텍처로 자리 잡았습니다: 최첨단 모델들은 희소 전문가 라우팅을 통해 총 매개변수와 토큰당 계산량을 분리하는 것이 일반화되었습니다. 확장 법칙에 따르면, 고정된 활성 계산량 하에서 모델 품질은 총 매개변수에 따라 예측 가능하게 확장되며, MoE는 전문가 수를 증가시켜 이를 실현합니다. 그러나 대규모 MoE 학습은 메모리 요구 사항과 디바이스 간 통신이 모두 총 매개변수 수에 비례하여 증가하므로 비용이 많이 듭니다. 우리는 지속 사전 학습(CPT) 과정에서 전문가 수를 증가시켜 MoE 용량을 점진적으로 확장하는 방법인 전문가 업사이클링을 제안합니다. 학습된 E-전문가 모델이 주어지면, 업사이클링 연산자는 전문가 복제와 라우터 확장을 통해 상위-K 라우팅을 고정한 채 mE-전문가 모델을 구성하여 토큰당 추론 비용을 유지합니다. 복제는 효과적인 초기화를 제공합니다: 확장된 모델은 소스 체크포인트의 학습된 표현을 상속받아 무작위 초기화보다 현저히 낮은 손실에서 시작합니다. 이후 CPT는 복제된 전문가들 간의 대칭성을 깨뜨려 전문화를 유도합니다. 우리는 업사이클링 연산자를 공식화하고 품질 격차를 용량 항과 초기화 항으로 분해하는 이론적 프레임워크를 개발합니다. 또한 기울기 기반 중요도 점수를 사용하여 비균일 복제를 유도하는 효용 기반 전문가 선택을 도입하여, CPT가 제한될 때 격차 감소를 3배 이상 향상시킵니다. 총 매개변수 7B-13B 규모의 실험에서 업사이클링된 모델은 검증 손실에서 고정 크기 기준선과 동등한 성능을 달성하면서 GPU 시간의 32%를 절약했습니다. 모델 규모, 활성화 비율, MoE 아키텍처, 학습 예산에 걸친 포괄적 ablation 연구를 통해 전문가 업사이클링 배포를 위한 실용적인 방안을 제시하며, 대규모 MoE 모델을 처음부터 학습하는 것에 대한 원리 기반의 계산 효율적인 대안으로서의 타당성을 입증했습니다.

English

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

전문가 업사이클링: 혼합 전문가 모델의 계산 효율성 한계 이동

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

초록

Support