從稀疏到軟性專家混合模型

摘要

稀疏的專家混合架構（MoEs）可以擴展模型容量，而不會增加訓練或推理成本。儘管MoEs取得成功，但存在一些問題：訓練不穩定、標記丟失、無法擴展專家數量或微調無效。在這項研究中，我們提出了Soft MoE，這是一個完全可微分的稀疏Transformer，解決了這些挑戰，同時保留了MoEs的優勢。Soft MoE通過將所有輸入標記的不同加權組合傳遞給每個專家，執行隱式軟分配。與其他MoE作品一樣，Soft MoE中的專家僅處理（組合的）標記子集，從而實現更大的模型容量，並降低推理成本。在視覺識別方面，Soft MoE遠優於標準Transformer（ViTs）和流行的MoE變體（標記選擇和專家選擇）。例如，Soft MoE-Base/16的推理成本比ViT-Huge/14低10.5倍（牆鐘時間低5.7倍），在類似訓練後表現相當。Soft MoE還具有良好的擴展性：Soft MoE Huge/14在16個MoE層中擁有128個專家，比ViT Huge/14多40倍以上的參數，而推理時間成本僅增長2％，並且表現顯著更好。

English

Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we proposeSoft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms standard Transformers (ViTs) and popular MoE variants (Tokens Choice and Experts Choice). For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.

從稀疏到軟性專家混合模型

From Sparse to Soft Mixtures of Experts

摘要

Support