Transformer架构的可组合函数保持扩展

摘要

训练最先进的神经网络需要大量的计算资源和时间成本。模型规模被认为是实现和改进最先进水平的关键因素。增加神经网络的规模通常需要从头开始，通过随机初始化模型的所有参数，因为这意味着对架构参数的更改，不允许从较小规模模型直接转移知识。在这项工作中，我们提出了六种可组合的转换方式，逐步增加基于Transformer的神经网络的规模，同时保持功能性，从而允许根据需要扩展模型的容量。我们为每种转换提供了确保在最小初始化约束条件下精确保留功能的证明。所提出的方法可以通过在训练过程中逐步扩展架构，为更大更强大的模型提供高效的训练流程。

English

Training state-of-the-art neural networks requires a high cost in terms of compute and time. Model scale is recognized to be a critical factor to achieve and improve the state-of-the-art. Increasing the scale of a neural network normally requires restarting from scratch by randomly initializing all the parameters of the model, as this implies a change of architecture's parameters that does not allow for a straightforward transfer of knowledge from smaller size models. In this work, we propose six composable transformations to incrementally increase the size of transformer-based neural networks while preserving functionality, allowing to expand the capacity of the model as needed. We provide proof of exact function preservation under minimal initialization constraints for each transformation. The proposed methods may enable efficient training pipelines for larger and more powerful models by progressively expanding the architecture throughout training.

Transformer架构的可组合函数保持扩展

Composable Function-preserving Expansions for Transformer Architectures

摘要

Support