ChatPaper.aiChatPaper

通过辅助损失实现专家混合模型中专家与路由器的耦合

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

December 29, 2025
作者: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
cs.AI

摘要

混合专家(MoE)模型缺乏显式约束来确保路由器的决策与专家能力良好对齐,这最终限制了模型性能。为解决此问题,我们提出专家-路由器耦合(ERC)损失函数——一种轻量级辅助损失,可将路由决策与专家能力紧密耦合。我们的方法将每个专家的路由器嵌入视作分配给该专家的代币的代理标记,并通过专家网络输入扰动后的路由器嵌入以获取内部激活值。ERC损失对这些激活值施加双重约束:(1)每个专家对自身代理标记的激活强度必须高于对其他专家代理标记的激活;(2)每个代理标记在对应专家处激发的激活强度必须高于在其他专家处的激活。这些约束共同确保每个路由器嵌入能真实反映对应专家的能力,同时使每个专家专注于处理实际被路由至该专家的代币。ERC损失计算高效,仅需处理n²个激活值(n为专家数量),这种固定成本与批次大小无关,而现有耦合方法的计算量会随代币数量(通常每批次达数百万)线性增长。通过对3B至15B参数的MoE-LLM进行预训练及数万亿代币的广泛分析,我们验证了ERC损失的有效性。此外,该损失函数还能在训练过程中灵活控制并量化追踪专家专业化程度,为理解MoE模型提供了宝贵洞察。
English
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
PDF701December 31, 2025