无辅助损失的混合专家负载平衡策略
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
August 28, 2024
作者: Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai
cs.AI
摘要
对于混合专家(Mixture-of-Experts,MoE)模型,不平衡的专家负载会导致路由崩溃或增加计算开销。现有方法通常采用辅助损失来鼓励负载平衡,但大型辅助损失会在训练过程中引入非可忽略的干扰梯度,从而损害模型性能。为了在训练过程中控制负载平衡而不产生不良梯度,我们提出了Loss-Free Balancing,其特点是采用无辅助损失的负载平衡策略。具体而言,在进行前K个路由决策之前,Loss-Free Balancing将首先对每个专家的路由分数应用专家智能偏差。通过根据最近负载动态更新每个专家的偏差,Loss-Free Balancing可以始终保持专家负载的平衡分布。此外,由于Loss-Free Balancing不会产生任何干扰梯度,它还提高了从MoE训练中获得的模型性能上限。我们在具有多达30亿参数、训练多达2000亿标记的MoE模型上验证了Loss-Free Balancing的性能。实验结果表明,与传统的辅助损失控制的负载平衡策略相比,Loss-Free Balancing在性能和负载平衡方面均取得了更好的效果。
English
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to
routing collapse or increased computational overhead. Existing methods commonly
employ an auxiliary loss to encourage load balance, but a large auxiliary loss
will introduce non-negligible interference gradients into training and thus
impair the model performance. In order to control load balance while not
producing undesired gradients during training, we propose Loss-Free Balancing,
featured by an auxiliary-loss-free load balancing strategy. To be specific,
before the top-K routing decision, Loss-Free Balancing will first apply an
expert-wise bias to the routing scores of each expert. By dynamically updating
the bias of each expert according to its recent load, Loss-Free Balancing can
consistently maintain a balanced distribution of expert load. In addition,
since Loss-Free Balancing does not produce any interference gradients, it also
elevates the upper bound of model performance gained from MoE training. We
validate the performance of Loss-Free Balancing on MoE models with up to 3B
parameters trained on up to 200B tokens. Experimental results show that
Loss-Free Balancing achieves both better performance and better load balance
compared with traditional auxiliary-loss-controlled load balancing strategies.Summary
AI-Generated Summary