面向推理任务的最优稀疏专家混合语言模型
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
August 26, 2025
作者: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
cs.AI
摘要
经验性的规模定律推动了大规模语言模型(LLMs)的发展,然而每当模型架构或数据处理流程发生变化时,其系数也会随之调整。专家混合模型(MoE)作为当前顶尖系统中的标准配置,引入了一个新的稀疏维度,这是现有密集模型前沿所忽视的。我们探究了MoE稀疏性如何影响两种不同的能力范畴:记忆与推理。我们训练了一系列MoE Transformer模型,这些模型在保持计算预算不变的前提下,系统地变化总参数量、激活参数量以及top-k路由策略。对于每一个模型,我们记录了预训练损失、下游任务损失及任务准确率,从而能够将训练-测试泛化差距与损失-准确率差距区分开来。记忆基准测试随着总参数量的增加而单调提升,与训练损失相呼应。相比之下,推理性能则趋于饱和,甚至在总参数量和训练损失持续改善的情况下可能出现倒退。当激活参数量保持不变时,仅调整top-k影响甚微,而诸如学习率和初始化等经典超参数则以与稀疏性相同的方向调节泛化差距。无论是训练后的强化学习(GRPO)还是额外的测试时计算资源,都无法挽救过度稀疏模型在推理上的不足。我们的模型检查点、代码及日志已在https://github.com/rioyokotalab/optimal-sparsity开源。
English
Empirical scaling laws have driven the evolution of large language models
(LLMs), yet their coefficients shift whenever the model architecture or data
pipeline changes. Mixture-of-Experts (MoE) models, now standard in
state-of-the-art systems, introduce a new sparsity dimension that current
dense-model frontiers overlook. We investigate how MoE sparsity influences two
distinct capability regimes: memorization and reasoning. We train families of
MoE Transformers that systematically vary total parameters, active parameters,
and top-k routing while holding the compute budget fixed. For every model we
record pre-training loss, downstream task loss, and task accuracy, allowing us
to separate the train-test generalization gap from the loss-accuracy gap.
Memorization benchmarks improve monotonically with total parameters, mirroring
training loss. By contrast, reasoning performance saturates and can even
regress despite continued gains in both total parameters and training loss.
Altering top-k alone has little effect when active parameters are constant,
and classic hyperparameters such as learning rate and initialization modulate
the generalization gap in the same direction as sparsity. Neither post-training
reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning
deficit of overly sparse models. Our model checkpoints, code and logs are
open-source at https://github.com/rioyokotalab/optimal-sparsity.