从稀疏到软混合专家

摘要

稀疏专家混合体结构（MoEs）可以扩展模型容量，而无需大幅增加训练或推理成本。尽管取得成功，MoEs存在一些问题：训练不稳定、标记丢失、无法扩展专家数量或微调效果不佳。在这项工作中，我们提出Soft MoE，这是一个完全可微的稀疏Transformer，旨在解决这些挑战，同时保持MoEs的优点。Soft MoE通过向每个专家传递所有输入标记的不同加权组合来执行隐式软分配。与其他MoE工作类似，Soft MoE中的专家仅处理（组合的）标记子集，从而实现更大的模型容量，而推理成本更低。在视觉识别领域，Soft MoE远远优于标准Transformer（ViTs）和流行的MoE变体（Tokens Choice和Experts Choice）。例如，Soft MoE-Base/16的推理成本仅为ViT-Huge/14的10.5倍（墙钟时间降低5.7倍），在类似训练后性能相匹配。Soft MoE还具有良好的扩展性：Soft MoE Huge/14具有128个专家，在16个MoE层中的参数比ViT Huge/14多40倍以上，而推理时间成本仅增长2％，性能明显更好。

English

Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we proposeSoft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms standard Transformers (ViTs) and popular MoE variants (Tokens Choice and Experts Choice). For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.

从稀疏到软混合专家

From Sparse to Soft Mixtures of Experts

摘要

Support