大规模AI模型中稀疏专家混合体的无辅助损失负载均衡理论框架

摘要

在大规模人工智能训练中，稀疏专家混合层通过仅激活每个令牌对应的少量专家子集来实现扩展。该设计面临的一个操作挑战是负载均衡：如何路由令牌以最小化闲置专家数量，这对（昂贵的）GPU资源的高效利用至关重要。我们提出了一个理论框架，通过将DeepSeek团队Wang等人（2024）提出的无辅助损失负载均衡方法建模为分配问题的单步原始-对偶迭代算法进行分析。首先，在理想化的确定性场景下，我们的框架揭示了若干关键结构特性：（i）拉格朗日目标的单调改进性，（ii）使令牌从过载专家向欠载专家迁移的偏好规则，以及（iii）近似均衡保证。随后，我们采用广义在线优化公式纳入AI训练的随机性与动态特性。在线设定下，我们推导出目标的强凸性质，该性质在特定步长选择下可导出对数级期望遗憾界。此外，我们在10亿参数DeepSeekMoE模型上进行了真实实验以佐证理论发现。这些成果共同构建了分析AI模型中稀疏专家混合层无辅助损失负载均衡的原理性框架。

English

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.

大规模AI模型中稀疏专家混合体的无辅助损失负载均衡理论框架

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

摘要

Support