エキスパートアップサイクリング：Mixture-of-Expertsの計算効率フロンティアの転換

要旨

Mixture-of-Experts (MoE) は、大規模言語モデルのスケーリングにおける主要なアーキテクチャとして確立されました：フロンティアモデルでは、スパースなエキスパートルーティングにより、総パラメータ数とトークンあたりの計算量を分離することが常套手段となっています。スケーリング則によれば、アクティブな計算量が固定されている条件下では、モデルの品質は総パラメータ数に応じて予測可能に向上し、MoEはエキスパート数を増やすことでこれを実現します。しかし、大規模なMoEの訓練はコストが高く、メモリ要件とデバイス間通信の両方が総パラメータ数に比例して増加します。本研究では、継続事前学習（CPT）の過程でエキスパート数を増やすことで、MoEの容量を段階的に拡張する手法である**エキスパートアップサイクリング**を提案します。訓練済みのE個のエキスパートを持つモデルが与えられたとき、アップサイクリング演算子は、トップKルーティングを固定したまま（これによりトークンあたりの推論コストは維持される）、エキスパートの複製とルーターの拡張を通じてmE個のエキスパートを持つモデルを構築します。複製はウォームスタートを提供し、拡張されたモデルはソースのチェックポイントで学習された表現を継承するため、ランダム初期化よりも大幅に低い損失値から訓練を開始できます。その後実施するCPTにより、複製されたエキスパート間の対称性が破られ、専門化が促進されます。我々はこのアップサイクリング演算子を定式化し、品質ギャップを容量項と初期化項に分解する理論的枠組みを構築しました。さらに、**効用値に基づくエキスパート選択**を導入します。これは勾配ベースの重要度スコアを用いて非一様な複製を導くもので、CPTが限られている場合でも、ギャップ解消効果を3倍以上向上させます。総パラメータ数7Bから13B規模での実験において、アップサイクリングされたモデルは検証損失において固定サイズのベースラインと同等の性能を達成しつつ、GPU時間を32%削減しました。モデル規模、活性化比率、MoEアーキテクチャ、訓練バジェットを網羅した詳細なアブレーション研究を通じて、エキスパートアップサイクリングを実践するための具体的なレシピを確立し、大規模MoEモデルをゼロから訓練するための、原理的で計算効率の良い代替手法としての地位を固めました。

English

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

エキスパートアップサイクリング：Mixture-of-Expertsの計算効率フロンティアの転換

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

要旨

Support