路由流形对齐提升专家混合大语言模型的泛化能力
Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs
November 10, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI
摘要
稀疏专家混合模型(MoE)因其能在不增加推理成本的前提下高效扩展模型能力,近年来被广泛采用于大规模语言模型。然而,在广泛下游任务上的评估表明,现有MoE大语言模型中的路由器存在普遍次优问题,导致其与最优路由之间存在显著性能差距(例如准确率相差10-20%)。本文提出通过将路由权重的流形与任务嵌入的流形对齐,可有效缩小这一差距并提升MoE大语言模型的泛化性能。我们的方法"路由流形对齐(RoMA)"在训练后目标中引入额外的流形正则化项,仅需对路由器进行轻量级微调(其他参数冻结)。具体而言,该正则化促使每个样本的路由权重在任务嵌入空间中接近其成功近邻(即路由权重能得出正确答案的样本)的路由权重,从而使面向相似任务的样本在各网络层共享相似的专家选择机制。在不同样本间建立任务与专家的绑定关系,对实现更好泛化能力至关重要。此外,RoMA展现了将任务理解(通过嵌入模型)与解决方案生成(通过MoE大语言模型)相统一的优势。实验中,我们使用RoMA对OLMoE、DeepSeekMoE和Qwen3-MoE的路由器进行微调。在多基准测试上的评估及与基线模型的广泛对比表明,RoMA带来了显著性能提升。
English
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large
language models since it can efficiently scale up the model capability without
increasing the inference cost. However, evaluations on broad downstream tasks
reveal a consistent suboptimality of the routers in existing MoE LLMs, which
results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal
routing. In this paper, we show that aligning the manifold of routing weights
with that of task embedding can effectively reduce the gap and improve MoE
LLMs' generalization performance. Our method, "Routing Manifold Alignment
(RoMA)", introduces an additional manifold regularization term in the
post-training objective and only requires lightweight finetuning of routers
(with other parameters frozen). Specifically, the regularization encourages the
routing weights of each sample to be close to those of its successful neighbors
(whose routing weights lead to correct answers) in a task embedding space.
Consequently, samples targeting similar tasks will share similar expert choices
across layers. Building such bindings between tasks and experts over different
samples is essential to achieve better generalization. Moreover, RoMA
demonstrates the advantage of unifying the task understanding (by embedding
models) with solution generation (by MoE LLMs). In experiments, we finetune
routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse
benchmarks and extensive comparisons with baselines show the substantial
improvement brought by RoMA.