ChatPaper.aiChatPaper

时间扩展的专家混合模型

Temporally Extended Mixture-of-Experts Models

April 22, 2026
作者: Zeyu Shen, Peter Henderson
cs.AI

摘要

混合专家模型(MoE)通过固定推理速度下扩展模型容量而广受欢迎,但其几乎在每个标记处都会切换专家。一旦模型规模超出GPU显存容量,这种频繁切换会使卸载和预取等优化手段失效。我们认为强化学习中的选项框架是解决此问题的理想方案,并提出时序扩展的混合专家层架构。基于带有决策成本的选项批判框架,我们在每层添加控制器,学习何时切换专家集及加载哪些专家。通过将此法应用于配备低秩适配器和自蒸馏奖励机制的gpt-oss-20b模型,我们的方法在MATH、MMLU和MMMLU基准测试中保持基线模型90%精度的同时,将切换率从超过50%降至5%以下。这表明即使现有预训练模型也能通过轻量训练转换为时序扩展MoE,且决策成本允许模型训练者在切换率与能力之间进行权衡。我们期待这一基于选项框架的方法能为持续增长的MoE模型开辟内存高效服务与持续学习的理论路径。
English
Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.
PDF11April 25, 2026