混合専門家モデルの刈り込みと蒸留による高密度言語モデルへの変換

要旨

混合エキスパート（MoE）は、現在のフロンティア言語モデルにおける主流のアーキテクチャであるが、すべてのエキスパートパラメータをメモリに読み込む必要があるため、メモリ制約のあるデプロイ環境には適さない。既存の圧縮手法ではエキスパート数を削減するものの、出力は依然としてMoEモデルであり、同じ根本的な制約を抱えている。本稿では、訓練済みMoEモデルを標準的な完全密なアーキテクチャに変換するための初の体系的なフレームワークを提案する。エキスパートをスコアリングし、選択し、グループ化した後、密なFFNに連結し、MoE教師からの知識蒸留によって洗練させる。我々は、Qwen3-30B-A3Bにおいて、7種類のスコアリング手法、5種類のグループ化手法、2種類のマグニチュードスケーリング手法を、選択するエキスパート数を変えて評価し、合計350の設定を検証した。その結果、スコアリング手法の選択が最も影響が大きく、我々が新たに提案する多様性を考慮したスコアリングが、Qwen3-30B-A3B、DeepSeek-V2-Lite、GPT-OSS-20Bにおいて、従来手法を一貫して上回ることが示された。パラメータ数を一致させた比較実験では、MoEから密への変換は、密から密への枝刈りと比較して、約4Bトークンの蒸留後、平均下流タスク精度で+6.3パーセンテージポイント向上し、訓練のウォールクロック速度は1.6倍高速であった。

English

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.