動的計算割り当てと負荷分散による自己回帰型言語モデリングのための専門家閾値ルーティング

要旨

トークン選択型Mixture-of-Experts（TC-MoE）は、各トークンを固定数のエキスパートに振り分けるため、動的な計算リソース配分が制限され、負荷分散を維持するために補助損失関数が必要となります。本研究では、エキスパート閾値（ET）ルーティングを提案します。各エキスパートは、グローバルなトークン分布から推定された指数移動平均（EMA）閾値を保持します。訓練および推論時、各トークンはそのスコアがエキスパートの閾値を超えた場合に独立してエキスパートに振り分けられ、補助損失を必要とせずに負荷分散を達成しつつ、動的な計算リソース配分を可能にします。この完全因果的なメカニズムはバッチ内の他のトークンへの依存性を排除するため、自己回帰型言語モデリングに適しています。FineWeb-Eduデータセットでパラメータ数24億にスケールさせた事前学習実験において、ETはTC-MoEと比較して交差エントロピー損失を0.067低減し、これは同じ性能を1.6倍少ないトークン数で達成することに相当します。

English

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.

動的計算割り当てと負荷分散による自己回帰型言語モデリングのための専門家閾値ルーティング

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

要旨

Support