Transformer 架構的可組合函數保持擴展
Composable Function-preserving Expansions for Transformer Architectures
August 11, 2023
作者: Andrea Gesmundo, Kaitlin Maile
cs.AI
摘要
訓練最先進的神經網絡需要高昂的計算成本和時間。模型規模被認為是實現和改進最先進技術的關鍵因素。增加神經網絡的規模通常需要從頭開始,通過隨機初始化模型的所有參數,因為這意味著改變架構參數,不允許從較小尺寸模型直接轉移知識。在這項工作中,我們提出了六種可組合的轉換,逐步增加基於Transformer的神經網絡的規模,同時保持功能性,從而允許根據需要擴展模型的容量。我們證明了每種轉換在最小初始化約束下確保精確功能保留。所提出的方法可以通過在訓練過程中逐步擴展架構,為更大更強大的模型提供高效的訓練流程。
English
Training state-of-the-art neural networks requires a high cost in terms of
compute and time. Model scale is recognized to be a critical factor to achieve
and improve the state-of-the-art. Increasing the scale of a neural network
normally requires restarting from scratch by randomly initializing all the
parameters of the model, as this implies a change of architecture's parameters
that does not allow for a straightforward transfer of knowledge from smaller
size models. In this work, we propose six composable transformations to
incrementally increase the size of transformer-based neural networks while
preserving functionality, allowing to expand the capacity of the model as
needed. We provide proof of exact function preservation under minimal
initialization constraints for each transformation. The proposed methods may
enable efficient training pipelines for larger and more powerful models by
progressively expanding the architecture throughout training.