後訓練的混合專家模型可通過自蒸餾跳過一半專家

摘要

混合专家模型（MoE）通过稀疏专家激活高效扩展语言模型，其动态变体进一步根据输入自适应调整激活的专家，从而减少计算量。现有动态MoE方法通常依赖从头预训练或任务特定适配，尚未充分探索如何将已完全训练的MoE模型进行实用化转换。实现此类适配可直接降低推理成本，因为服务时可让简单token绕过不必要的专家计算。本文提出零专家自蒸馏适配方法（ZEDA），这是一个低成本框架，能将训练后的静态MoE模型转换为高效的动态模型。为稳定这种架构转换，ZEDA在每个MoE层注入无参数零输出专家，并通过两阶段自蒸馏适配增强后的模型，同时利用原始MoE作为冻结教师模型，并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上，针对涵盖数学、代码和指令遵循的11个基准测试，ZEDA在仅微小精度损失的情况下消除了超过50%的专家FLOP。它在两个模型上分别比最强的动态MoE基线高出6.1和4.0个点，并实现了约1.20倍的端到端推理加速。

English

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-end inference speedup.