ChatPaper.aiChatPaper

路由流形對齊提升專家混合大型語言模型的泛化能力

Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

November 10, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI

摘要

稀疏專家混合模型(MoE)近年來被廣泛應用於大型語言模型,因其能在不增加推理成本的前提下有效擴展模型能力。然而,在多樣化下游任務的評估中,現有MoE大型語言模型的路由器始終存在次優問題,導致其與最優路由存在顯著性能差距(例如準確率相差10-20%)。本文論證通過將路由權重的流形與任務嵌入的流形對齊,可有效縮小此差距並提升MoE大型語言模型的泛化性能。我們提出的「路由流形對齊(RoMA)」方法,在訓練後目標函數中引入額外的流形正則化項,僅需對路由器進行輕量級微調(其餘參數凍結)。具體而言,該正則化促使每個樣本的路由權重在任務嵌入空間中接近其成功鄰居(即路由權重能導出正確答案的樣本)的權重分佈,從而使針對相似任務的樣本在各網絡層中共享相近的專家選擇模式。這種跨樣本建立任務與專家間的綁定關係,對實現更優泛化能力至關重要。此外,RoMA展現出將任務理解(通過嵌入模型)與解決方案生成(通過MoE大型語言模型)相統一的優勢。實驗中,我們使用RoMA對OLMoE、DeepSeekMoE及Qwen3-MoE的路由器進行微調。在多樣化基準測試中的評估結果及與基線模型的廣泛比較表明,RoMA帶來了顯著性能提升。
English
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.
PDF252December 2, 2025