后训练MoE可通过自蒸馏跳过半数专家

摘要

混合专家模型（MoE）通过稀疏专家激活高效扩展语言模型，其动态变体进一步根据输入调整激活的专家数量以减少计算量。现有动态MoE方法通常依赖从头预训练或特定任务适配，而对完全训练好的MoE模型进行实用转换的研究尚不充分。实现这种适配可直接缓解推理成本，因为简单token可在服务时绕过不必要的专家。本文提出零专家自蒸馏适应（ZEDA），这是一种低成本框架，能将训练后的静态MoE模型转换为高效的动态MoE。为稳定这种架构转换，ZEDA在每个MoE层注入无参数的零输出专家，并通过两阶段自蒸馏对增强后的模型进行适配——利用原始MoE作为冻结教师模型，并应用组级平衡损失。在涵盖数学、代码和指令遵循的11个基准测试上，对Qwen3-30B-A3B和GLM-4.7-Flash进行实验，ZEDA在精度损失可忽略的情况下消除了超过50%的专家FLOPs。其在两个模型上分别比最强的动态MoE基线高出6.1和4.0个点，并实现约1.20倍的端到端推理加速。

English

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-end inference speedup.