동적 계산 할당 및 부하 분산을 통한 자기회귀 언어 모델링을 위한 전문가 임계값 라우팅

초록

토큰 선택 전문가 혼합(TC-MoE)은 각 토큰을 고정된 수의 전문가로 라우팅하여 동적 계산 할당을 제한하고 부하 균형을 유지하기 위해 보조 손실이 필요합니다. 우리는 전문가 임계값(ET) 라우팅을 제안합니다. 여기서 각 전문가는 글로벌 토큰 분포에서 추정된 지수 이동 평균(EMA) 임계값을 유지합니다. 학습 및 추론 시 각 토큰은 점수가 전문가의 임계값을 초과하면 독립적으로 전문가로 라우팅되어, 보조 손실 없이 부하 균형을 달성하면서 동적 계산 할당이 가능합니다. 이 완전 인과적 메커니즘은 배치 내 다른 토큰에 대한 의존성을 제거하여 자기 회귀 언어 모델링에 매우 적합합니다. FineWeb-Edu에서 2.4B 매개변수 규모의 사전 학습 실험에서 ET는 TC-MoE보다 0.067 낮은 교차 엔트로피 손실을 달성했으며, 이는 1.6배 적은 토큰으로 동일한 성능에 도달하는 것과 같습니다.

English

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.

동적 계산 할당 및 부하 분산을 통한 자기회귀 언어 모델링을 위한 전문가 임계값 라우팅

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

초록

Support