超网络混合体:通过架构路由的专家混合体改进权重共享超网络训练
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
June 8, 2023
作者: Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Raghuraman Krishnamoorthi, Vikas Chandra
cs.AI
摘要
在当前最先进的神经架构搜索(NAS)框架中,权重共享的超网络已经成为性能估计的重要组成部分。虽然超网络可以直接生成不同的子网络而无需重新训练,但由于权重共享,这些子网络的质量无法保证。在诸如机器翻译和预训练语言建模等NLP任务中,我们观察到在相同的模型架构下,超网络和从头开始训练之间存在很大的性能差距。因此,超网络不能直接使用,需要在找到最佳架构后进行重新训练。
在这项工作中,我们提出了混合超网络,这是一种广义的超网络形式,其中采用了专家混合(MoE)来增强超网络模型的表达能力,而训练开销可以忽略不计。通过这种方式,不同的子网络不直接共享模型权重,而是通过基于架构的路由机制进行共享。因此,不同子网络的模型权重针对其特定架构进行定制,并且权重生成是通过梯度下降学习的。与现有的用于NLP的权重共享超网络相比,我们的方法可以最小化重新训练时间,极大地提高训练效率。此外,所提出的方法在NAS中实现了建立快速机器翻译模型的最先进性能,在延迟-BLEU权衡方面优于HAT,这是机器翻译的最先进NAS。我们还在构建内存高效的通用任务BERT模型的NAS中实现了最先进的性能,在各种模型大小上优于NAS-BERT和AutoDistil。
English
Weight-sharing supernet has become a vital component for performance
estimation in the state-of-the-art (SOTA) neural architecture search (NAS)
frameworks. Although supernet can directly generate different subnetworks
without retraining, there is no guarantee for the quality of these subnetworks
because of weight sharing. In NLP tasks such as machine translation and
pre-trained language modeling, we observe that given the same model
architecture, there is a large performance gap between supernet and training
from scratch. Hence, supernet cannot be directly used and retraining is
necessary after finding the optimal architectures.
In this work, we propose mixture-of-supernets, a generalized supernet
formulation where mixture-of-experts (MoE) is adopted to enhance the expressive
power of the supernet model, with negligible training overhead. In this way,
different subnetworks do not share the model weights directly, but through an
architecture-based routing mechanism. As a result, model weights of different
subnetworks are customized towards their specific architectures and the weight
generation is learned by gradient descent. Compared to existing weight-sharing
supernet for NLP, our method can minimize the retraining time, greatly
improving training efficiency. In addition, the proposed method achieves the
SOTA performance in NAS for building fast machine translation models, yielding
better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We
also achieve the SOTA performance in NAS for building memory-efficient
task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various
model sizes.