Transformer架构的可组合函数保持扩展
Composable Function-preserving Expansions for Transformer Architectures
August 11, 2023
作者: Andrea Gesmundo, Kaitlin Maile
cs.AI
摘要
训练最先进的神经网络需要大量的计算资源和时间成本。模型规模被认为是实现和改进最先进水平的关键因素。增加神经网络的规模通常需要从头开始,通过随机初始化模型的所有参数,因为这意味着对架构参数的更改,不允许从较小规模模型直接转移知识。在这项工作中,我们提出了六种可组合的转换方式,逐步增加基于Transformer的神经网络的规模,同时保持功能性,从而允许根据需要扩展模型的容量。我们为每种转换提供了确保在最小初始化约束条件下精确保留功能的证明。所提出的方法可以通过在训练过程中逐步扩展架构,为更大更强大的模型提供高效的训练流程。
English
Training state-of-the-art neural networks requires a high cost in terms of
compute and time. Model scale is recognized to be a critical factor to achieve
and improve the state-of-the-art. Increasing the scale of a neural network
normally requires restarting from scratch by randomly initializing all the
parameters of the model, as this implies a change of architecture's parameters
that does not allow for a straightforward transfer of knowledge from smaller
size models. In this work, we propose six composable transformations to
incrementally increase the size of transformer-based neural networks while
preserving functionality, allowing to expand the capacity of the model as
needed. We provide proof of exact function preservation under minimal
initialization constraints for each transformation. The proposed methods may
enable efficient training pipelines for larger and more powerful models by
progressively expanding the architecture throughout training.