超網混合：通過架構路由的專家混合改進權重共享超網訓練。

摘要

在最先進的神經架構搜索（NAS）框架中，權重共享的超網已成為性能估計的重要組成部分。儘管超網可以直接生成不需重新訓練的不同子網絡，但由於權重共享，這些子網絡的質量無法保證。在機器翻譯和預訓練語言建模等自然語言處理（NLP）任務中，我們觀察到在相同的模型架構下，超網和從頭開始訓練之間存在著很大的性能差距。因此，在找到最佳架構後，超網無法直接使用，需要重新訓練。在這項工作中，我們提出了混合超網，這是一種通用的超網形式，其中採用了專家混合（MoE）來增強超網模型的表達能力，並具有可忽略的訓練開銷。通過這種方式，不同的子網絡不直接共享模型權重，而是通過基於架構的路由機制進行共享。因此，不同子網絡的模型權重會根據其特定架構進行定制，權重生成是通過梯度下降學習的。與現有的用於NLP的權重共享超網相比，我們的方法可以最小化重新訓練時間，大大提高訓練效率。此外，所提出的方法在NAS中實現了最先進的性能，用於構建快速機器翻譯模型，在延遲和BLEU之間取得更好的折衷，比HAT（機器翻譯的最先進NAS）更勝一籌。我們還在NAS中實現了構建記憶效率高的通用BERT模型的最先進性能，在各種模型大小上優於NAS-BERT和AutoDistil。

English

Weight-sharing supernet has become a vital component for performance estimation in the state-of-the-art (SOTA) neural architecture search (NAS) frameworks. Although supernet can directly generate different subnetworks without retraining, there is no guarantee for the quality of these subnetworks because of weight sharing. In NLP tasks such as machine translation and pre-trained language modeling, we observe that given the same model architecture, there is a large performance gap between supernet and training from scratch. Hence, supernet cannot be directly used and retraining is necessary after finding the optimal architectures. In this work, we propose mixture-of-supernets, a generalized supernet formulation where mixture-of-experts (MoE) is adopted to enhance the expressive power of the supernet model, with negligible training overhead. In this way, different subnetworks do not share the model weights directly, but through an architecture-based routing mechanism. As a result, model weights of different subnetworks are customized towards their specific architectures and the weight generation is learned by gradient descent. Compared to existing weight-sharing supernet for NLP, our method can minimize the retraining time, greatly improving training efficiency. In addition, the proposed method achieves the SOTA performance in NAS for building fast machine translation models, yielding better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We also achieve the SOTA performance in NAS for building memory-efficient task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various model sizes.

超網混合：通過架構路由的專家混合改進權重共享超網訓練。

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

摘要

Support