ChatPaper.aiChatPaper

透過輔助損失在專家混合模型中耦合專家與路由器

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

December 29, 2025
作者: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
cs.AI

摘要

混合專家模型(MoE)缺乏明確約束機制來確保路由器的決策與專家能力精準匹配,這最終限制了模型效能。為解決此問題,我們提出專家-路由器耦合損失函數(ERC),這是一種輕量級輔助損失函數,能將路由器決策與專家能力緊密耦合。我們的方法將每個專家的路由器嵌入向量視作分配給該專家的代幣代理表徵,並將擾動後的路由器嵌入輸入專家網絡以獲取內部激勵值。ERC損失函數對這些激勵值施加雙重約束:(1)每個專家對自身代理代幣的激勵值必須高於對其他專家代理代幣的激勵值;(2)每個代理代幣在其對應專家處產生的激勵值必須強於在其他專家處的激勵值。這些約束共同確保每個路由器嵌入向量能真實反映對應專家的能力特徵,同時使每個專家專精於處理實際分配給它的代幣。ERC損失函數具有計算高效性,僅需處理n²個激勵值(n為專家數量)。與先前依賴代幣數量(通常每批次達數百萬)而擴展的耦合方法不同,此方法具有與批次大小無關的固定計算成本。我們透過對3B至15B參數的MoE-LLM進行預訓練,並在數萬億代幣上進行廣泛分析,驗證了ERC損失函數的有效性。此外,ERC損失函數能在訓練過程中靈活控制並量化追蹤專家專業化程度,為混合專家模型的研究提供重要洞察。
English
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
PDF701December 31, 2025