基于动态计算分配与负载均衡的自回归语言建模专家阈值路由

摘要

傳統的令牌選擇專家混合（TC-MoE）方法將每個令牌路由至固定數量的專家，這限制了動態計算資源分配，並需依賴輔助損失函數來維持負載平衡。我們提出專家閾值（ET）路由機制：每個專家根據全局令牌分佈維持一個指數移動平均閾值。在訓練與推理過程中，當令牌的評分超過專家閾值時即被獨立路由至該專家，既能實現動態計算分配，又可無需輔助損失函數達成負載均衡。這種完全因果機制消除了對批次內其他令牌的依賴，尤其適合自迴歸語言建模。在FineWeb-Edu數據集上進行的2.4B參數預訓練實驗中，ET路由相比TC-MoE的交叉熵損失降低0.067，相當於用減少1.6倍的訓練令牌量達到同等性能。

English

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.

基于动态计算分配与负载均衡的自回归语言建模专家阈值路由

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

摘要

Support