基于动态计算分配与负载均衡的自回归语言建模专家阈值路由

摘要

传统令牌选择专家混合（TC-MoE）方法将每个令牌路由至固定数量的专家，这限制了动态计算分配能力且需依赖辅助损失函数维持负载均衡。我们提出专家阈值路由（ET）方法：每个专家通过指数移动平均法（EMA）根据全局令牌分布估算动态阈值。在训练和推理过程中，当令牌的评分超过专家阈值时即被独立路由至该专家，从而实现动态计算分配，并在无需辅助损失的情况下达成负载均衡。这种完全因果机制消除了对批次内其他令牌的依赖，使其特别适合自回归语言建模。在FineWeb-Edu数据集上进行的2.4B参数预训练实验中，ET方法相比TC-MoE实现了交叉熵损失降低0.067，等效于用减少1.6倍的训练令牌量达到同等性能。

English

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.