Transformer 架構的可組合函數保持擴展

摘要

訓練最先進的神經網絡需要高昂的計算成本和時間。模型規模被認為是實現和改進最先進技術的關鍵因素。增加神經網絡的規模通常需要從頭開始，通過隨機初始化模型的所有參數，因為這意味著改變架構參數，不允許從較小尺寸模型直接轉移知識。在這項工作中，我們提出了六種可組合的轉換，逐步增加基於Transformer的神經網絡的規模，同時保持功能性，從而允許根據需要擴展模型的容量。我們證明了每種轉換在最小初始化約束下確保精確功能保留。所提出的方法可以通過在訓練過程中逐步擴展架構，為更大更強大的模型提供高效的訓練流程。

English

Training state-of-the-art neural networks requires a high cost in terms of compute and time. Model scale is recognized to be a critical factor to achieve and improve the state-of-the-art. Increasing the scale of a neural network normally requires restarting from scratch by randomly initializing all the parameters of the model, as this implies a change of architecture's parameters that does not allow for a straightforward transfer of knowledge from smaller size models. In this work, we propose six composable transformations to incrementally increase the size of transformer-based neural networks while preserving functionality, allowing to expand the capacity of the model as needed. We provide proof of exact function preservation under minimal initialization constraints for each transformation. The proposed methods may enable efficient training pipelines for larger and more powerful models by progressively expanding the architecture throughout training.

Transformer 架構的可組合函數保持擴展

Composable Function-preserving Expansions for Transformer Architectures

摘要

Support